CN112562659B

CN112562659B - Speech recognition method, device, electronic equipment and storage medium

Info

Publication number: CN112562659B
Application number: CN202011460228.0A
Authority: CN
Inventors: 高建清; 万根顺
Original assignee: Iflytek Shanghai Technology Co ltd
Current assignee: Iflytek Shanghai Technology Co ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2024-04-09
Anticipated expiration: 2040-12-11
Also published as: CN112562659A

Abstract

The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application record data of a plurality of associated users. According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, application record data of different users in the same voice recognition scene among different applications are obtained, the scene association text is extracted by utilizing the similarity of the attention points among the associated users, an auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of a voice recognition result obtained based on the scene association text is improved.

Description

Speech recognition method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech recognition method, apparatus, electronic device, and storage medium.

Background

With the continuous development of artificial intelligence technology, the voice recognition technology is widely applied to meeting, interview, teaching, lecture and other scenes.

Existing speech recognition techniques typically pre-acquire corpus that may be relevant to the current use scenario prior to performing speech recognition to assist in performing speech recognition. However, if theme change occurs during actual voice collection and voice recognition, or the corpus obtained in advance is wrong, the accuracy of voice recognition is lowered.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the defect of poor voice recognition accuracy in the prior art.

The embodiment of the invention provides a voice recognition method, which comprises the following steps:

determining voice data to be recognized;

performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application record data of a plurality of associated users.

According to an embodiment of the present invention, the voice recognition method performs voice recognition on the voice data based on the scene associated text corresponding to the voice data, to obtain a voice recognition result of the voice data, including:

Decoding acoustic hidden layer features of the voice data based on scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each period of the voice data;

the speech recognition result is determined based on the probability of each candidate word segment for each period of the speech data.

According to the voice recognition method of one embodiment of the invention, the scene-related text comprises a hotword;

decoding the acoustic hidden layer feature of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word of each period of the voice data, including:

and correcting the probability of each candidate word segmentation of the voice data in each period based on the hot word or the hot word and the excitation coefficient thereof, and determining the voice recognition result based on the corrected probability of each candidate word segmentation in each period.

According to a voice recognition method of one embodiment of the present invention, the hotword is determined based on the steps of:

determining a first duration range of historical voice data of the voice data;

screening query keywords input in the first duration range from application use data of the plurality of associated users;

And selecting at least a preset number of query keywords input by users, and/or selecting the query keywords which are input by each user and are associated with the current scene as the hotwords.

According to the voice recognition method of the embodiment of the invention, the hot words appearing in the query keywords of at least two users, the hot words with repeated words or similar words in the query keywords of any user and the excitation coefficients of other hot words are sequentially decreased, and the higher the frequency of occurrence of any hot word in the query keywords of different users, the higher the excitation coefficient of any hot word.

According to the voice recognition method of one embodiment of the invention, the scene-related text comprises history extension texts corresponding to each history voice segment of the voice data;

and decoding the acoustic hidden layer characteristics of the voice data based on the universal corpus and the historical expanded text corresponding to each historical voice fragment to obtain the probability of each candidate word segmentation of the voice data in each period.

According to an embodiment of the present invention, the method for speech recognition decodes acoustic hidden layer features of the speech data based on a general corpus and a history expanded text corresponding to each history speech segment to obtain probabilities of each candidate word segment of the speech data, including:

decoding acoustic hidden layer features of any period of the voice data based on a general corpus and historical expanded texts corresponding to each historical voice segment respectively to obtain candidate probabilities of any candidate word segmentation of any period of the general corpus and each historical voice segment;

determining the probability of any candidate word segmentation based on the candidate probability of any candidate word segmentation corresponding to the general corpus and each historical voice segment and the weight corresponding to the general corpus and each historical voice segment;

wherein the more recent the voice data, the greater the weight corresponding to the historical voice segment.

According to an embodiment of the present invention, the method for speech recognition decodes acoustic hidden layer features of any period of the speech data based on the universal corpus and the history expanded text corresponding to each history speech segment, to obtain candidate probabilities of any candidate word segmentation of any period corresponding to the universal corpus and each history speech segment, includes:

And determining the candidate probability of any candidate word corresponding to any historical voice segment based on each type of historical expanded text corresponding to the any historical voice segment and the importance coefficient corresponding to the historical expanded text.

According to the voice recognition method of the embodiment of the invention, the history extension texts of all types comprise at least one of browsing content extension texts, hotword inquiry extension texts and preset extension texts;

the browsing content extension text corresponding to any historical voice fragment is acquired based on the following steps:

determining a second duration range of the any one of the historical speech segments;

screening browsing content in the second duration range from application record data of the plurality of associated users;

and selecting at least one of browsing contents associated with the hotword, browsing contents associated with at least two users and browsing contents associated with the current scene as browsing content extension text corresponding to any one of the historical voice fragments.

The embodiment of the invention also provides a voice recognition device, which comprises:

a voice data determining unit for determining voice data to be recognized;

The voice recognition unit is used for carrying out voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of any one of the voice recognition methods when executing the program.

The embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech recognition method as described in any of the above.

According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, application record data of different users in the same voice recognition scene among different applications are obtained, the scene association text is extracted by utilizing the similarity of the attention points among the associated users, an auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of a voice recognition result obtained based on the scene association text is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a voice recognition method according to another embodiment of the present invention;

FIG. 3 is a schematic flow chart of a hotword determining method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of query keywords provided in an embodiment of the present invention;

fig. 5 is a schematic flow chart of a decoding method according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating a method for determining extended text of browsing content according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of browsing content according to an embodiment of the present invention;

FIG. 8 is a flowchart of a voice recognition method according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In this regard, the embodiment of the invention provides a voice recognition method. Fig. 1 is a flow chart of a voice recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, determining voice data to be recognized;

step 120, performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the context associated text is determined based on application record data of a plurality of associated users.

Here, the plurality of associated users are a plurality of intelligent terminal users associated in the same speech recognition scene. For example, in a conference scenario, multiple associated users may be multiple participants of the conference, in a lecture scenario, multiple associated users may be multiple listeners of the lecture, etc. Because the multiple associated users are in the same speech recognition scene, the multiple associated users use application record data generated by different applications on the mobile terminal in the speech recognition scene, such as data obtained by inquiring or browsing by each user by using applications such as search engines, entertainment shopping or life services in the conference or speech process, the association degree with the current speech recognition scene is usually large, and even if theme change occurs currently, the theme change is reflected in the application record data of the multiple associated users, so that the acquired scene association text can be adjusted accordingly, and the high association between the scene association text and the current scene is ensured.

Therefore, the text with the greater degree of association with the current scene can be mined as scene association text based on the application record data of a plurality of associated users. According to the similarity of the attention points of different users in the same speech recognition scene, the correlation between the application record data provided by different users is utilized to mutually confirm the correlation degree between the application record data provided by each user and the current speech recognition scene, so that texts more correlated with the current speech recognition scene are acquired, the recognition accuracy is improved, irrelevant text contents are removed, and false triggering of speech recognition is relieved. In addition, the scene associated text is acquired from the application record data of a plurality of associated users, so that the user bias caused by the mode of acquiring the associated text from the application record data of a single user can be overcome, and the association degree of the scene associated text and the current scene can be improved.

Here, a sharing mechanism may be first established between a plurality of associated users to acquire application record data of each user. For example, sharing advice may be initiated by any user, sending and accepting sharing messages through channels of intercommunication within an existing local area network. When other users confirm to participate in sharing, time-synchronized proofreading and confirmation can be performed synchronously. For example, the time of the intelligent terminal equipment used by the user initiating the sharing suggestion can be taken as a reference, and the intelligent terminal equipment of other users records the time deviation of the intelligent terminal equipment from the initiator, so that the time synchronization is realized.

And then, based on the scene associated text corresponding to the voice data, assisting in voice recognition to obtain a voice recognition result of the voice data to be recognized. For example, semantic information of voice data to be recognized can be determined in an assisted manner based on scene associated text, and a language expression mode which is more matched with the current context can be provided, so that ambiguity caused by homonyms or near-homonyms and the like is eliminated, recognition results which are more in accordance with the current scene language expression specification are obtained, and the accuracy of voice recognition is improved.

According to the method provided by the embodiment of the invention, the scene association text is extracted by acquiring the application record data of different users in the same voice recognition scene among different applications and utilizing the similarity of the focus points among the associated users, so that the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene association text is improved.

Based on the foregoing embodiments, fig. 2 is a schematic flow chart of a voice recognition method according to another embodiment of the present invention, as shown in fig. 2, step 120 includes:

step 121, decoding the acoustic hidden layer feature of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each period of the voice data;

Step 122, determining a speech recognition result based on the probability of each candidate word for each period of the speech data.

The context-associated text corresponding to the voice data to be recognized can provide a language expression mode which is more matched with the context of the current scene, so that correct words can be selected from words with the same or similar voice, and recognition results which are more matched with the language expression specification of the current scene can be obtained.

Thus, in the process of decoding the acoustic hidden layer feature of the speech data, for any period of the speech data, for example, a pronunciation process corresponding to a word or a word, the probability of candidate word segmentation that may be expressed by the period may be determined in combination with the phoneme information and the scene-related text contained in the acoustic hidden layer feature of the period. Wherein the acoustic hidden layer features of the speech data may be used to determine the acoustic state and phonemes to which the speech data corresponds. Then, based on the probability of each candidate word segment of each period of the voice data, the word segment corresponding to each period is determined, thereby combining the voice recognition results forming the whole voice data.

Based on any of the above embodiments, the scene associated text includes a hotword;

Step 121 includes:

the probability of each candidate word of each period of the speech data is corrected based on the hot word or based on the hot word and its excitation coefficient, and the speech recognition result is determined based on the corrected probability of each candidate word of each period.

Here, the scene-related text may include keywords frequently appearing in application record data of a plurality of associated users, i.e., hotwords. Since hotwords frequently occur at multiple associated users in the current speech recognition scenario, it can be inferred that the likelihood of occurrence of the hotword in the speech data is also greater. Thus, the probability of each candidate word segment for each period of the speech data can be corrected based on the hotword. For example, for any period, the probability of the candidate word being a hot word may be increased by a preset value, thereby increasing the probability that the candidate word being a hot word is selected as the word corresponding to the period.

Furthermore, multiple hotwords may be included in the scene-associated text, and the importance of different hotwords may not be consistent. For example, a hot word having a high number of occurrences is more important, or a hot word having a high importance is found in application record data of a plurality of users, and attention of a plurality of users is paid to the hot word. Therefore, when the hot word excitation is carried out, the hot words with different importance can be distinguished, and different excitation coefficients are set for the hot words with different importance, so that the effect of the hot word excitation is improved, and the accuracy of voice recognition is further improved. Wherein the excitation coefficient of the hot word with higher importance is higher, and the increased value is also higher when the probability of candidate word segmentation is corrected. Then, the probability of each candidate word segment of the speech data for each period is corrected based on the hot word and its excitation coefficient. For example, for any period, when correcting the probability of a candidate word therein as a hot word, a preset value may be multiplied by the excitation coefficient of the hot word and then added to the probability of the candidate word.

According to the method provided by the embodiment of the invention, the scene associated text comprises the obtained hot words, so that the possibility that the candidate word serving as the hot word is selected as the corresponding word of any period is improved in a hot word excitation mode, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, fig. 3 is a schematic flow chart of a hotword determining method according to an embodiment of the present invention, as shown in fig. 3, where the method includes:

step 310, a first duration range of historical voice data for the voice data is determined.

Here, the first duration range may be a period of time for which the historical voice data is continuous from the start of collection to the end of collection, and the historical voice data may be one or more sentences preceding the current voice data. The historical voice data and the corresponding first duration range thereof can be intercepted according to the boundary information of the historical voice recognition result obtained before the current voice data.

Step 320, filtering query keywords entered within a first duration range from application usage data of a plurality of associated users.

Here, the query keyword in the application usage data of the plurality of associated users and the time for performing the keyword search may be first acquired, for example, by using an input method function of the intelligent terminal, an input record generated by a pinyin input method, a voice input method, or a handwriting input method is used as the query keyword, and the time for generating the query keyword is used as the time for performing the keyword search. Then, according to the time of keyword searching, query keywords input in a first duration range are acquired. Fig. 4 is a schematic diagram of a query keyword provided in an embodiment of the present invention, as shown in fig. 4, it is assumed that a first duration range is T0 to T1, where T0 may represent a start of historical voice data, and T1 may represent an end of historical voice data, and may also represent a start of current voice data. The query keywords obtained by screening in the first duration range are shown in fig. 4, wherein U1K1 represents the 1 st query keyword input by the user 1 in the application, and UNKM represents the mth query keyword input by the user N in the application.

Step 330, selecting at least a preset number of query keywords input by users, and/or selecting the query keywords input by each user and associated with the current scene as hotwords.

Here, if a plurality of different users, for example, more than 2 users, all input the same query keyword, the keyword may be considered as a hotword; for the query keywords only appearing at a certain user, the correlation between the query keywords and the current existing voice recognition result or the pre-acquired text related to the current scene can be calculated based on the TF-IDF strategy, and a threshold value is set to select the query keywords with larger correlation degree with the current scene as hotwords. If any query keyword has other query keywords that are the same or similar in the query keywords of any user, a relatively low threshold may be set, and for other query keywords, a relatively high threshold may be set. And then, based on the hot words corresponding to the historical voice data, carrying out duplication elimination on the hot words obtained in the step, and combining the hot words corresponding to the historical voice data with the duplicated hot words obtained in the step to obtain the hot words corresponding to the current voice data. In addition, the hotword corresponding to the current voice data may be immediately validated, or may be revalidated at the end time of the first duration range, which is not particularly limited in the embodiment of the present invention.

Before the query keyword input method, the query keyword input device and the query keyword input system, the query keyword input device can be used for repeatedly inputting the same query keyword by any user in different applications or inputting the same or similar query keywords after multiple times of input or automatic correction of an engine in the same application, so that the query keywords of the same user can be de-duplicated. First, for any user i, the query keywords that it inputs are Ki1 to KiM. If the M query keywords have identical words, deleting the rest repeated query keywords; if the M query keywords are query keywords which are considered to be similar in pronunciation based on a pinyin restoration scheme or query keywords which are considered to be similar in font based on an existing font similarity checking scheme, calculating the correlation between the query keywords which are similar in pronunciation or font similarity and the current existing speech recognition result or the pre-collected text related to the current scene based on the TF-IDF strategy, and setting a threshold value to select the query keywords which are more related to the current scene from the query keywords which are similar in pronunciation or font similarity. If the correlation corresponding to the query keywords with similar pronunciation or similar font does not reach the threshold value, only the last query keyword is reserved.

According to the method provided by the embodiment of the invention, the association degree of the hot words and the current voice recognition scene is improved by selecting the query keywords input by a plurality of users and/or the query keywords which are input by each user and are associated with the current scene as the hot words, so that the accuracy of voice recognition is further improved.

Based on any of the above embodiments, the excitation coefficients of the hotwords appearing in the query keywords of at least two users, the hotwords having the repeated or similar words in the query keywords of any user, and other hotwords decrease in order, and the higher the frequency of occurrence of any hotword in the query keywords of different users, the greater the excitation coefficient thereof.

Here, considering that one hotword appears in query keywords of a plurality of users, it indicates that the hotword is focused by a plurality of users, so that the probability that the hotword appears in voice data in the voice recognition scene is also greater, and therefore the importance of the hotword is greater than that of the hotword that appears in query keywords of only one user. Meanwhile, if the frequency of occurrence of any hotword in the query keywords of different users is higher, the hotword is focused by more users, and the importance of the hotword is higher. In addition, if any hot word has a duplicate word or a similar word in the query keyword of any user, it indicates that the hot word is important for the user, so that the importance of the hot word is greater than that of other hot words.

Therefore, when the excitation coefficients of the hotwords are set, the hotwords appearing in the query keywords of at least two users, the hotwords with repeated words or similar words in the query keywords of any one user, and the excitation coefficients of other hotwords are sequentially decreased, and the higher the frequency of occurrence of any one hotword in the query keywords of different users, the higher the excitation coefficient thereof.

Wherein, for the hot words appearing in the query keywords of at least two users, the excitation coefficient thereof can be set to 1+ (the number of times appearing between different users/the number of users); for the hot words with repeated words or similar words in the query keywords of any user, the excitation coefficient of the hot words can be set to be 1+1/the number of users, and the excitation coefficients of other hot words can be set to be 1.

According to the method provided by the embodiment of the invention, when the excitation coefficients are set for the hot words, the hot words which appear in the query keywords of at least two users, the hot words which have repeated words or similar words in the query keywords of any user, and the excitation coefficients of other hot words are gradually decreased, and the higher the frequency of occurrence of any hot word in the query keywords of different users is, the larger the excitation coefficient is, so that the hot words with different importance are distinguished, and the accuracy of voice recognition is further improved.

Based on any of the above embodiments, the scene-related text includes history extension text corresponding to each history voice clip of the voice data;

step 121 includes:

and decoding the acoustic hidden layer characteristics of the voice data based on the universal corpus and the historical expanded text corresponding to each historical voice fragment to obtain the probability of each candidate word of each period of the voice data.

Here, the scene-related text may further include a history extension text corresponding to each history voice clip of the current voice data. The history expanded text corresponding to any history voice segment can be text obtained or expanded from application record data of a plurality of associated users in a period from the beginning of collection to the ending of collection of the history voice segment. For example, content that the user browses over this period of time, or other text related to content that the user browses over this period of time or the content of the query. Because each history extension text is generated in the collection process of each history voice fragment of the current voice data, the association degree of each history extension text and the current voice recognition scene is higher, a language expression mode which is more matched with the context of the current scene can be provided for the current voice data, correct words can be selected from words with the same or similar multi-pronunciation, and recognition results which are more in accordance with the language expression specification of the current scene can be obtained.

Therefore, when the acoustic hidden layer features of the current voice data are decoded, the general corpus and the historical expansion text corresponding to each historical voice segment can be combined to be used as a corpus referenced when the probability of each candidate word segmentation corresponding to each time period of the voice data is calculated by a language model. For example, when the language model is a statistical language model, in order to calculate the n-gram probability of each candidate word segment corresponding to any period, the probability of each candidate word segment appearing therein may be counted according to a corpus composed of a generic corpus and a history expanded text corresponding to each history speech segment. Wherein, the more the candidate word is in accordance with the language expression mode given by the corpus, the higher the probability.

According to the method provided by the embodiment of the invention, the scene associated text comprises the history extension text corresponding to each history voice fragment, so that the probability of candidate word segmentation according with the language expression mode given by each history extension text is improved by updating the language model corpus, and the recognition result more according with the current scene language expression specification is obtained, thereby improving the accuracy of voice recognition.

Based on any of the above embodiments, fig. 5 is a flow chart of a decoding method provided by the embodiment of the present invention, as shown in fig. 5, decoding acoustic hidden layer features of speech data based on a general corpus and a history expanded text corresponding to each history speech segment to obtain probabilities of each candidate word segment of the speech data, including:

And 1211, decoding the acoustic hidden layer characteristics of any period of the voice data based on the universal corpus and the historical expanded text corresponding to each historical voice segment respectively to obtain the candidate probability of any candidate word of the period corresponding to the universal corpus and each historical voice segment.

And respectively taking the general corpus and the historical expanded text corresponding to each historical voice segment as the corpus of the language model, decoding the acoustic hidden layer characteristics of any period of the voice data, and calculating to obtain the candidate probability of any candidate word in the period corresponding to the general corpus and each historical voice segment.

Taking a ternary language model (Trigram language model) as an example, the general corpus and the historical extension text corresponding to each historical voice segment can be respectively used as the corpus of the language model, and the candidate probability P of any candidate word segmentation in any period of the general corpus and each historical voice segment can be calculated _t (w _x |w _x-2 w _x-1 )、P _Pi (w _x |w _x-2 w _x-1 )、P _P(i-1) (w _x |w _x-2 w _x-1 )、…、P _P1 (w _x |w _x-2 w _x-1 ). Wherein, the historical voice fragments have i segments and w _x-2 And w _x-1 For word segmentation corresponding to the first two time periods of any time period, P _t (w _x |w _x-2 w _x-1 ) For the candidate probability of any candidate word segmentation in any period of the corresponding general corpus, P _Pi (w _x |w _x-2 w _x-1 )、P _P(i-1) (w _x |w _x-2 w _x-1 )、…、P _P1 (w _x |w _x-2 w _x-1 ) Candidate probabilities of any candidate word segment for any period of time corresponding to each historical speech segment.

Step 1212, determining a probability of the candidate word segmentation based on the candidate probabilities of the candidate word segmentation corresponding to the generic corpus and each of the historical speech segments and weights corresponding to the generic corpus and each of the historical speech segments;

wherein, the more recent the voice data, the greater the corresponding weight of the historical voice segment.

Here, the candidate probabilities of any candidate word segment corresponding to the generic corpus and each of the historical speech segments may be weighted and summed to obtain the probability of the candidate word segment. The closer any historical voice segment is to the current voice data, the greater the correlation degree between the historical expansion text generated in the collection process of the historical voice segment and the current voice data is, the more accurate the probability of any candidate word segmentation calculated according to the correlation degree is, and therefore the greater the weight is. When weights are set for each historical voice segment, a basic weight and a forgetting coefficient can be preset, n times of the forgetting coefficient is multiplied on the basis of the basic weight, wherein n=1. Since the sum of all weights is 1 in the weighted summation, the difference between 1 and the sum of the weights of each historical speech segment can be used as the weight of the generic corpus. For example, the probability of the candidate word may be determined using the following formula:

P _new (w _x |w _x-2 w _x-1 )

＝(1-α)βP _Pi (w _x |w _x-2 w _x-1 )

+(1-α)β ² P _P(i-1) (w _x |w _x-2 w _x-1 )+…

+(1-α)β ⁱ P _P1 (wx|w _x-2 w _x-1 )+[1-(1-α)β

-(1-α)β ² -…-(1-α)β ⁱ ]P _t (w _x |w _x-2 w _x-1 )

Wherein P is _new (w _x |w _x-2 w _x-1 ) The probability of the candidate word segmentation is 1-alpha as basic weight, and beta as forgetting coefficient.

According to the method provided by the embodiment of the invention, the general corpus and the historical extension texts corresponding to each historical voice segment are respectively used as the corpus of the language model, the candidate probability of any candidate word of the corresponding general corpus and each historical voice segment is calculated, the probability of the candidate word is determined based on the candidate probability of the candidate word of the corresponding general corpus and each historical voice segment and the weight corresponding to the general corpus and each historical voice segment, the importance of the historical extension texts corresponding to each historical voice segment is distinguished, and the historical extension texts corresponding to the historical voice segments which are closer to the current voice data are highlighted, so that the accuracy of voice recognition is improved.

Based on any of the above embodiments, step 1211 includes:

and determining the candidate probability of the candidate word segmentation corresponding to the historical voice segment based on the historical expanded text of each type corresponding to any historical voice segment and the importance coefficient corresponding to the historical expanded text.

Here, in order to enrich the history extension text, different types of history extension text may be acquired from different approaches. For example, content related to the current scene that the user browsed during the collection of the historical speech segments, or other text related to content that the user browsed or queried during this time period, may be obtained. The degree of association between the different types of history extension texts and the current scene is different, so that the functions of the different types of history extension texts are correspondingly different when the voice recognition is performed.

In order to embody that each type of history extension text plays a different role in the decoding process, a corresponding importance coefficient may be set for each type of history extension text. Wherein, the more relevant the history extension text of any type is to the current scene, the higher the importance coefficient thereof is. And respectively taking each type of history expanded text corresponding to any history voice fragment as a corpus of a language model, calculating candidate probabilities of the candidate word segmentation corresponding to each type of history expanded text, and then carrying out weighted summation based on importance coefficients corresponding to each type of history expanded text to obtain the candidate probabilities of the candidate word segmentation corresponding to the history voice fragment.

Based on any of the above embodiments, each type of history extension text includes at least one of a browsing content extension text, a hotword query extension text, and a preset extension text.

The method comprises the steps that a browsing content expansion text is a text which is acquired from browsing data of a plurality of associated users and is associated with a current scene, a hotword inquiry expansion text is a text which is obtained by carrying out keyword inquiry in a pre-acquired corpus based on the existing hotword, a preset expansion text is a text which is obtained by carrying out text similarity calculation on the browsing content expansion text and/or the hotword inquiry expansion text and is associated with the browsing content expansion text and/or the hotword inquiry expansion text to a larger degree after the text similarity calculation is carried out in the pre-acquired corpus. Here, considering that the browsing content expanded text is obtained from browsing contents of a plurality of associated users, the association degree of the browsing content expanded text with the current scene is higher, and therefore, the importance coefficient of the browsing content expanded text is higher than that of the hotword query expanded text and the preset expanded text. For example, the importance coefficients of the hotword query expanded text and the preset expanded text may be set to 1, and the importance coefficient of the browsing content expanded text may be set to be higher.

Fig. 6 is a flowchart of a method for determining extended text of browsing content according to an embodiment of the present invention, as shown in fig. 6, where the method includes:

step 610, a second duration range of the historical speech segment is determined.

Here, the second duration range may be a period of time for which the historical speech segment lasts from a start of acquisition to an end of acquisition. The method comprises the steps of obtaining current voice data, obtaining historical voice recognition results, and intercepting each historical voice fragment and a corresponding first duration range according to the segmentation information of the historical voice recognition results obtained before the current voice data.

Step 620, filtering the browsed content in the second duration range from the application record data of the plurality of associated users.

Here, the browsing contents in the application usage data of a plurality of associated users and the time of generating the browsing contents may be acquired first. For example, the text content of the web page corresponding to the browsed web address of each user in different applications can be obtained, or the interface browsed by each user is automatically screen-captured, the text content in the interface is obtained based on the existing optical character recognition method, and the time for generating the browsed content is recorded. Then, according to the time of generating the browsing content, the browsing content in the second duration range is acquired. Fig. 7 is a schematic diagram of browsing content provided in the embodiment of the present invention, as shown in fig. 7, it is assumed that the second duration range is P0 to P1, where P0 may represent the beginning of the historical speech segment, T1 may represent the ending of the historical speech segment, and may also represent the beginning of the next historical speech segment. The browsed content obtained by screening in the second duration range is shown in fig. 7, where U1H1 represents the 1 st browsed content of the user 1 in the application, and UNHL represents the L-th browsed content of the user N in the application.

And 630, selecting at least one of browsing contents associated with the hotword, browsing contents associated with at least two users and browsing contents associated with the current scene as browsing content extension text corresponding to the historical voice fragment.

Here, the browsing content associated with the hotword may be selected as the browsing content extension text. For example, for any user, based on the hotword obtained from the collection process of the historical voice fragment, the correlation between the hotword and each browsing content of the user can be calculated by using the existing TF-IDF strategy, and the browsing content with higher correlation is screened out as the browsing content expansion text. For the rest browsing contents, the relevance measurement can be carried out on the browsing contents from different users, and if the relevance is strong, the browsing contents can be used as browsing content extension text. For the browsing content expanded text obtained in the two modes, for convenience of description, the browsing content expanded text can be called as important browsing content expanded text among users, and the importance coefficient can be set to be 1+the selected text quantity/total text quantity so as to emphasize browsing content focused by different users. The selected text quantity is the extended text quantity of the browsing contents screened in the two modes in all the browsing contents, and the total text quantity is the quantity of all the browsing contents.

For the rest browsing content, the relevance measurement between the browsing content and the current existing voice recognition result can be calculated, and if the relevance is strong, the browsing content can be selected as the browsing content expansion text so as to ensure the strong relevance between the browsing content expansion text and the current scene. For the browsing content extension text obtained in this way, for convenience of description, it may be referred to as an important browsing content extension text within the user, and the importance coefficient thereof may be set to 1.

On the basis, when the candidate probability of any candidate word of any period corresponding to any historical voice fragment is determined based on each type of historical expanded text corresponding to any historical voice fragment and the corresponding importance coefficient thereof, the candidate probability of the candidate word of the corresponding user important browsing content expanded text, the hot word query expanded text and the preset expanded text can be calculated based on the corpus of language models respectively, and then weighted summation is carried out based on the importance coefficients corresponding to the user important browsing content expanded text, the hot word query expanded text and the preset expanded text, so that the candidate probability of the candidate word of the corresponding historical voice fragment is obtained. For example, the candidate probabilities for the candidate segmentations corresponding to the historical speech segments may be determined using the following formula:

P _Pi (w _x |w _x-2 w _x-1 )

＝U _Ei P _Ei (w _x |w _x-2 w _x-1 )+U _Ii P _Ii (w _x |w _x-2 w _x-1 )

+U _Si P _Si (w _x |w _x-2 w _x-1 )+U _Bi P _B (w _x |w _x-2 w _x-1 )

Wherein P is _Pi (w _x |w _x-2 w _x-1 ) Candidate probabilities of the candidate word segments corresponding to the historical speech segment i; p (P) _Ei (w _x |w _x- ₂ w _x-1 ) Expanding the candidate probability of the candidate word of the text for the important browsing content among the users, U _Ei As its importance coefficient; p (P) _Ii (w _x |w _x-2 w _x-1 ) Expanding the candidate probability of the candidate word of the text for the important browsing content in the corresponding user, U _Ii As its importance coefficient; p (P) _Si (w _x |w _x-2 w _x-1 ) Expanding the candidate probability of the candidate word of the text for the corresponding hotword query, U _Si As its importance coefficient; p (P) _B (w _x |w _x-2 w _x-1 ) Candidate word segmentation candidates for the candidate corresponding to the preset expanded textProbability, U _Bi Is the importance coefficient thereof.

Based on any of the above embodiments, fig. 8 is a flowchart of a voice recognition method according to another embodiment of the present invention, as shown in fig. 8, where the method includes:

at step 810, a sharing mechanism is established between a plurality of associated users. Information sharing among a plurality of associated users is performed so as to acquire application record data of each user.

Step 820, obtain application record data generated by each user using different applications. For example, query keywords input by each user through different applications such as search engine class, entertainment shopping class or life service class and browsing contents for browsing the query results are obtained.

In step 830, based on the application record data of the plurality of associated users, the hotword corresponding to the voice data to be recognized and the effective time thereof are determined. The hotword may be generated by using the hotword determining method provided in any of the above embodiments, which is not described herein. In addition, the effective time of each hotword is the end time of the first duration range.

In step 840, based on the application record data of the plurality of associated users, a history extension text corresponding to each history voice segment of the voice data to be recognized and the effective time thereof are determined. The history extension text comprises browsing content extension text, hotword inquiry extension text and preset extension text. The extended text of the browsing content may be generated by using the method for determining the extended text of the browsing content provided in any of the above embodiments, which is not described herein. The history expanded text corresponding to any one history voice segment is valid within the duration range of the next voice segment of the history voice segment.

Step 850, performing voice recognition on the voice data based on the hotword and the history expanded text corresponding to the voice data to be recognized, thereby obtaining a voice recognition result of the voice data.

The following describes a voice recognition device provided by an embodiment of the present invention, and the voice recognition device described below and the voice recognition method described above may be referred to correspondingly.

Based on any of the above embodiments, fig. 9 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention, and as shown in fig. 9, the device includes a voice data determining unit 910 and a voice recognition unit 920.

Wherein the voice data determining unit 910 is configured to determine voice data to be recognized;

the voice recognition unit 920 is configured to perform voice recognition on the voice data based on the scene-related text corresponding to the voice data, so as to obtain a voice recognition result of the voice data;

According to the device provided by the embodiment of the invention, the scene association text is extracted by acquiring the application record data of different users in the same voice recognition scene among different applications and utilizing the similarity of the focus points among the associated users, so that the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene association text is improved.

Based on any of the above embodiments, the voice recognition unit 920 includes:

the decoding unit is used for decoding the acoustic hidden layer characteristics of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each period of the voice data;

And a speech recognition result determining unit for determining a speech recognition result based on the probability of each candidate word for each period of the speech data.

the decoding unit includes:

the hot word excitation unit is used for correcting the probability of each candidate word of each period of voice data based on the hot word or based on the hot word and the excitation coefficient thereof, and determining a voice recognition result based on the corrected probability of each candidate word of each period.

According to the device provided by the embodiment of the invention, the scene associated text comprises the obtained hot words, so that the possibility that the candidate word serving as the hot word is selected as the corresponding word of any period is improved in a hot word excitation mode, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, the apparatus further includes a hotword determining unit configured to:

determining a first duration range of historical voice data of the voice data;

screening query keywords input in a first duration range from application use data of a plurality of associated users;

and selecting at least a preset number of query keywords input by users, and/or selecting the query keywords input by each user and associated with the current scene as hotwords.

According to the device provided by the embodiment of the invention, the query keywords input by a plurality of users and/or the query keywords which are input by each user and are associated with the current scene are selected as the hot words, so that the association degree of the hot words and the current speech recognition scene is improved, and the accuracy of speech recognition is further improved.

According to the device provided by the embodiment of the invention, when the excitation coefficients are set for the hot words, the hot words which appear in the query keywords of at least two users, the hot words which have repeated words or similar words in the query keywords of any user, and the excitation coefficients of other hot words are gradually decreased, and the higher the frequency of occurrence of any hot word in the query keywords of different users is, the larger the excitation coefficient is, so that the hot words with different importance are distinguished, and the accuracy of voice recognition is further improved.

the decoding unit includes:

and the probability calculation unit is used for decoding the acoustic hidden layer characteristics of the voice data based on the universal corpus and the historical expanded text corresponding to each historical voice fragment to obtain the probability of each candidate word segmentation of the voice data in each period.

According to the device provided by the embodiment of the invention, the scene associated text comprises the history expanded text corresponding to each history voice fragment, so that the probability of candidate word segmentation according with the language expression mode given by each history expanded text is improved by updating the language model corpus, and the recognition result more according with the current scene language expression specification is obtained, thereby improving the accuracy of voice recognition.

Based on any of the above embodiments, the probability calculation unit includes:

the candidate probability calculation unit is used for decoding the acoustic hidden layer characteristics of any period of the voice data based on the universal corpus and the historical extension texts corresponding to the historical voice fragments respectively to obtain candidate probabilities of any candidate word segmentation of the period corresponding to the universal corpus and the historical voice fragments;

The probability determining unit is used for determining the probability of the candidate word segmentation based on the candidate probability of the candidate word segmentation corresponding to the general corpus and each historical voice segment and the weight corresponding to the general corpus and each historical voice segment;

According to the device provided by the embodiment of the invention, the general corpus and the historical extension texts corresponding to each historical voice segment are respectively used as the corpus of the language model, the candidate probability of any candidate word of the corresponding general corpus and each historical voice segment is calculated, the probability of the candidate word is determined based on the candidate probability of the candidate word of the corresponding general corpus and each historical voice segment and the weight corresponding to the general corpus and each historical voice segment, the importance of the historical extension texts corresponding to each historical voice segment is distinguished, the historical extension texts corresponding to the historical voice segments which are closer to the current voice data are highlighted, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, the candidate probability calculation unit is configured to:

The apparatus further comprises a browse content extension text determination unit for:

determining a second duration range of the historical speech segment;

screening browsing contents in a second duration range from application record data of a plurality of associated users;

at least one of browsing content associated with the hotword, browsing content associated with at least two users and browsing content associated with the current scene is selected as browsing content extension text corresponding to the historical voice segment.

Fig. 10 illustrates a physical structure diagram of an electronic device, as shown in fig. 10, which may include: a processor 1010, a communication interface (Communications Interface) 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a speech recognition method comprising: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application record data of a plurality of associated users.

Further, the logic instructions in the memory 1030 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the speech recognition method provided by the above-described method embodiments, the method comprising: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application record data of a plurality of associated users.

In yet another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech recognition method provided by the above embodiments, the method comprising: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application record data of a plurality of associated users.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech recognition, comprising:

determining voice data to be recognized;

the scene association text is determined based on application record data of a plurality of associated users;

the step of performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data comprises the following steps:

decoding acoustic hidden layer features of the voice data based on scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each period of the voice data; the scene associated text is used for selecting correct words from words with the same or similar pronunciation;

2. The method of claim 1, wherein the scene-related text comprises a hotword;

3. The method of claim 2, wherein the hotword is determined based on the steps of:

determining a first duration range of historical voice data of the voice data;

4. A speech recognition method according to claim 2 or 3, wherein the excitation coefficients of the hotwords appearing in the query keywords of at least two users, the hotwords having the repeated or similar words in the query keywords of any one user, and the other hotwords are successively decreased, and the higher the frequency of occurrence of any one hotword in the query keywords of a different user, the higher the excitation coefficient thereof.

5. The method of claim 1, wherein the scene-related text comprises history extension text corresponding to each history voice segment of the voice data;

decoding acoustic hidden layer features of the voice data based on a general corpus and the historical expanded text corresponding to each historical voice segment to obtain the probability of each candidate word segmentation of the voice data in each period;

the history expanded text corresponding to any history voice segment is text obtained or expanded from application record data of a plurality of associated users in a period from the beginning of collection to the ending of collection of any history voice segment.

6. The method according to claim 5, wherein decoding the acoustic hidden layer feature of the speech data based on the general corpus and the history expanded text corresponding to each of the history speech segments to obtain the probability of each candidate word segment of the speech data for each period, comprises:

7. The method for recognizing speech according to claim 6, wherein the decoding the acoustic hidden layer feature of any period of the speech data based on the universal corpus and the history expanded text corresponding to each of the history speech segments, respectively, to obtain candidate probabilities of any candidate word segment of any period corresponding to the universal corpus and each of the history speech segments, comprises:

8. The method of claim 7, wherein each type of history extension text comprises at least one of a browsing content extension text, a hotword query extension text, and a preset extension text;

9. A speech recognition apparatus, comprising:

a voice data determining unit for determining voice data to be recognized;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the speech recognition method according to any one of claims 1 to 8 when the program is executed.

11. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 8.