WO2024099055A1 - 语音识别方法、装置及电子设备 - Google Patents

语音识别方法、装置及电子设备 Download PDF

Info

Publication number
WO2024099055A1
WO2024099055A1 PCT/CN2023/125743 CN2023125743W WO2024099055A1 WO 2024099055 A1 WO2024099055 A1 WO 2024099055A1 CN 2023125743 W CN2023125743 W CN 2023125743W WO 2024099055 A1 WO2024099055 A1 WO 2024099055A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
sample
speech
subsequent
Prior art date
Application number
PCT/CN2023/125743
Other languages
English (en)
French (fr)
Inventor
屠明
柳璐
夏瑞
李鑫
黄传增
王雨轩
Original Assignee
脸萌有限公司
抖音视界有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 脸萌有限公司, 抖音视界有限公司 filed Critical 脸萌有限公司
Publication of WO2024099055A1 publication Critical patent/WO2024099055A1/zh

Links

Definitions

  • the embodiments of the present disclosure relate to the field of speech processing technology, and more particularly to a speech recognition method, device, and electronic device.
  • Speech recognition technology can convert speech information into text information.
  • an electronic device can convert a speech into text through automatic speech recognition technology and display the text corresponding to the speech.
  • a language model can be added to the speech recognition model.
  • the language model can predict the next text corresponding to the text associated with the current speech.
  • the language model can assist the speech recognition model in recognizing the next speech.
  • long-tail words i.e., words with low frequency of use
  • the present disclosure provides a speech recognition method, device and electronic device, which are used to solve the technical problem of low accuracy of speech recognition in the prior art.
  • the present disclosure provides a speech recognition method, the speech recognition method comprising:
  • the first set including multiple text identifiers and a text feature corresponding to each text identifier in the text identifiers, the text feature being a feature associated with multiple subsequent texts of the text corresponding to the text identifier, the text feature being associated with frequencies of the multiple subsequent texts of the text in a text set, the first set being determined based on the text set;
  • the present disclosure provides a speech recognition device, comprising a first acquisition module, a second acquisition module, a third acquisition module and a determination module, wherein:
  • the first acquisition module is used to acquire a first voice
  • the second acquisition module is used to acquire a first text corresponding to a previous segment of speech of the first speech
  • the third acquisition module is used to acquire a first set, wherein the first set includes a plurality of text identifiers and a text feature corresponding to each of the plurality of text identifiers, wherein the text feature is a feature associated with multiple subsequent texts of the text corresponding to the text identifier, and the text feature is associated with the frequency of the multiple subsequent texts of the text in the text set, and the first set is determined based on the text set;
  • the determination module is used to determine text content associated with the first speech based on the first text and the first set.
  • an embodiment of the present disclosure provides an electronic device including: a processor and a memory;
  • the memory stores computer-executable instructions
  • the processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the speech recognition method as described in the first aspect and various possible aspects of the first aspect.
  • an embodiment of the present disclosure provides a computer-readable storage medium, in which computer execution instructions are stored.
  • a processor executes the computer execution instructions, the speech recognition method as described in the first aspect and various possible aspects of the first aspect are implemented.
  • the present disclosure provides a computer program product, including a computer program product.
  • a computer program when executed by a processor, implements the first aspect as well as various possible speech recognition methods involved in the first aspect.
  • the present disclosure provides a speech recognition method, device and electronic device.
  • the electronic device can obtain a first speech and obtain a first text corresponding to the previous speech of the first speech.
  • the electronic device can obtain a first set, wherein the first set includes multiple text identifiers and text features corresponding to each of the multiple text identifiers.
  • the text features are features associated with multiple subsequent texts of the text corresponding to the text identifier.
  • the text features are associated with the frequencies of the multiple subsequent texts of the text in the text set.
  • the first set is determined based on the text set. Based on the first text and the first set, the text content associated with the first speech is determined.
  • the text features in the first set are associated with the frequencies of the multiple subsequent texts of the text in the text set, the text features in the first set can integrate more features of low-frequency words.
  • the electronic device can obtain more context information through the first set, and then accurately recognize the first speech, thereby improving the accuracy of speech recognition.
  • FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • FIG2 is a flow chart of a speech recognition method provided by an embodiment of the present disclosure.
  • FIG3A is a schematic diagram of a process for obtaining a first text provided by an embodiment of the present disclosure
  • FIG3B is a schematic diagram of another process of obtaining a first text provided by an embodiment of the present disclosure.
  • FIG4 is a schematic diagram of a process for obtaining a first text feature provided by an embodiment of the present disclosure
  • FIG5 is a schematic diagram of a process for determining a second text provided by an embodiment of the present disclosure
  • FIG6 is a schematic diagram of a method for obtaining a first set provided by an embodiment of the present disclosure
  • FIG7 is a schematic diagram of a process for determining an initial set provided by an embodiment of the present disclosure.
  • FIG8 is a schematic diagram of a process for updating a first text feature provided by an embodiment of the present disclosure
  • FIG9 is a process diagram of a speech recognition method provided by an embodiment of the present disclosure.
  • FIG10 is a schematic diagram of the structure of a speech recognition device provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure.
  • Electronic device is a device with wireless transceiver function. Electronic devices can be deployed on land, including indoors or outdoors, handheld, wearable or vehicle-mounted.
  • the electronic device can be a mobile phone, a tablet computer, a computer with wireless transceiver function, a virtual reality (VR) electronic device, an augmented reality (AR) electronic device, a wireless terminal in industrial control, a vehicle-mounted electronic device, a wireless terminal in self-driving, a wireless electronic device in remote medical, a wireless electronic device in smart grid, a wireless electronic device in transportation safety, a wireless electronic device in smart city, a wireless electronic device in smart home, a wearable electronic device, etc.
  • VR virtual reality
  • AR augmented reality
  • the electronic devices involved in the embodiments of the present disclosure can also be called terminals, user equipment (UE), access electronic devices, vehicle-mounted terminals, industrial control terminals, UE units, UE stations, mobile stations, mobile A mobile station, a remote station, a remote electronic device, a mobile device, a UE electronic device, a wireless communication device, a UE agent or a UE apparatus, etc.
  • An electronic device may also be fixed or mobile.
  • a language model when performing speech recognition, can be added to the speech recognition model, and the language model can predict the next paragraph of text corresponding to the text associated with the current speech, and then the language model can assist the speech recognition model in recognizing the next paragraph of speech. For example, when the speech recognition model obtains a text, the language model can predict the next text of the text, and the speech recognition model predicts the next paragraph of speech through the next paragraph of speech collected and the next text predicted by the language model.
  • the samples in the training sets of speech recognition models and language models are usually samples obtained from the Internet, and there are fewer long-tail words (words with less frequent use) in the training sets obtained from the Internet.
  • the speech recognition model and language model cannot learn more long-tail word information during the training process, which makes the speech recognition model and language model have low recognition accuracy for long-tail words, resulting in low speech recognition accuracy.
  • an electronic device obtains a first speech, and obtains a first text corresponding to the previous speech of the first speech, the electronic device can obtain a first set, wherein the first set includes multiple text identifiers and text features corresponding to each of the multiple text identifiers, the text features are features associated with the subsequent text corresponding to the text identifier, the lower the frequency of the subsequent text in the text set, the more features of the subsequent text are integrated in the text features, the electronic device can determine the next second text of the first text based on the first text and the first set, and determine the text content associated with the first speech based on the second text and the first speech.
  • the electronic device can obtain more context information of the first text through the first text features, improve the prediction accuracy of the next text (that is, the predicted text of the first speech), and then accurately recognize the first speech in combination with the next text, thereby improving the accuracy of speech recognition.
  • FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure.
  • Language model speech recognition model and first set.
  • the first text of the previous speech of the first speech is input into the language model.
  • the language model can obtain the context information associated with the first text in the first set based on the identifier of the first text, and then predict the next text of the first text based on the first text and the context information associated with the first text.
  • the speech recognition model obtains the first speech
  • the speech recognition model can determine the text content of the first speech based on the next text and the first speech. In this way, when predicting the next text through the first text, the text prediction model can obtain the context information associated with the first text in the first set. Therefore, the text prediction model can accurately predict the next text corresponding to the previous speech of the first speech, thereby assisting the speech recognition model to accurately recognize the first speech and improve the accuracy of speech recognition.
  • FIG. 1 is only an example of an application scenario of the embodiment of the present disclosure, and is not intended to limit the application scenario of the embodiment of the present disclosure.
  • FIG2 is a flow chart of a speech recognition method provided by an embodiment of the present disclosure. Referring to FIG2 , the method may include:
  • the execution subject of the embodiment of the present disclosure may be an electronic device, or a speech recognition device arranged in the electronic device.
  • the speech recognition device may be implemented by software, or by a combination of software and hardware, which is not limited in the embodiment of the present disclosure.
  • the first voice may be any voice acquired by the electronic device.
  • the first voice may be a real-time voice of a user acquired by the electronic device, or the first voice may be a stored voice acquired in the memory of the electronic device, or the electronic device may acquire the first voice through other electronic devices, which is not limited in the embodiments of the present disclosure.
  • the electronic device may receive a voice sent by other electronic devices and determine the voice as the first voice.
  • the first speech can be a speech of any length.
  • the speech may include 1 sound, 2 sounds or 3 sounds, etc.
  • the first speech may also be 1 second of speech, 2 seconds of speech or 3 seconds of speech, etc., which is not limited in the embodiments of the present disclosure.
  • S202 Obtain a first text corresponding to a previous segment of speech of the first speech.
  • the first text may be a text character.
  • the first text may be a text consisting of 1 character, or the first text may be a text consisting of 2 characters, which is not limited in the embodiment of the present disclosure.
  • the previous speech of the first speech may include one or more Chinese characters or words, which is not limited in the embodiment of the present disclosure.
  • the electronic device when the electronic device acquires the first voice, it can acquire the previous voice of the first voice based on the first voice, and the electronic device can recognize the previous voice of the first voice through ASR technology to obtain the first text. For example, if the previous voice of the first voice acquired by the electronic device is the voice "hello", the electronic device can convert the previous voice of the first voice through ASR technology to obtain the text "hello".
  • the electronic device may determine the input text as the first text. For example, if the user inputs the text "you” to the electronic device, the electronic device may determine the text "you” as the first text; if the user inputs the text "hello” to the electronic device, the electronic device may determine the text "hello” as the first text; if the user inputs the text "the weather is great today” to the electronic device, the electronic device may determine the text "the weather is great today” as the first text.
  • the electronic device can determine the last M (M is greater than 1, and M is less than the number of characters in the first text) characters in the first text as the first text, which can improve the accuracy of text prediction. For example, if the text input by the user to the electronic device is "The weather is really good today", the electronic device can determine the text "good” as the first text, and the electronic device can also determine the text "really good” as the first text, and the embodiments of the present disclosure are not limited to this.
  • FIG3A is a schematic diagram of a process of obtaining a first text provided by an embodiment of the present disclosure.
  • an electronic device is included.
  • the display page of the electronic device is a chat page between user B (the user of the electronic device) and user A.
  • User A sends a text message "It is a nice day today", and user B replies with a text message "Yes”, and user B enters the text message "I am Let’s go”, the electronic device determines that the first text is “Let’s go”.
  • FIG3B is another schematic diagram of a process for obtaining a first text provided by an embodiment of the present disclosure.
  • the last voice message sent by the user to the electronic device is “The weather is so nice today”.
  • the electronic device After the electronic device receives the voice message, it can convert the voice message into the text “The weather is so nice today” through ASR technology, and determine that the first text is the last 4 characters of the text “The weather is so nice”. In this way, the electronic device can obtain the first text through the user’s voice, thereby improving the flexibility and efficiency of text acquisition.
  • the first set includes multiple text identifiers and text features corresponding to each of the multiple text identifiers.
  • the first set may include multiple correspondences, each of which includes a text identifier and a text feature.
  • the first set may be a dictionary, and the dictionary may include multiple key-value pairs.
  • the text identifier may be a key value of the text.
  • the text identifier may be a key value associated with a text character.
  • the text identifier of the text is the key value corresponding to the character
  • the text identifier corresponding to the text may be determined based on the two characters.
  • an electronic device may obtain multiple texts in a text set, and then determine the text identifier corresponding to each text, and add the text identifier to the first set during the construction of the first set.
  • the electronic device can determine the text identifier of the text based on the following feasible implementation method: determine the character identifier corresponding to each character in the text and the number of texts in the first set.
  • the character identifier can be the serial number of the text in the first set, and the number of texts can be the total number of texts in the first set. For example, if the first set can include 10,000 texts, the number of texts is 10,000.
  • the first set includes 10,000 texts, if the serial number of text A is 2000 and the serial number of text B is 3000, then the character serial number of text A is 2000 and the character serial number of text B is 3000.
  • the text identifier is determined based on the character identifier and the number of texts. For example, if the character identifier corresponding to the text is 2000 and the number of texts is 10000, then the text identifier corresponding to the text is 2000 mod 10000, where mod is a modulus operator. This method can be used to determine the text identifiers corresponding to multiple texts.
  • the text feature is a feature associated with multiple subsequent texts of the text corresponding to the text identifier.
  • the subsequent text is each subsequent text corresponding to the text in the text set.
  • the subsequent texts of the text "I” are the text "they” and the text "eat”.
  • the electronic device can obtain the character features corresponding to the text "they” and the character features corresponding to the text "eat”, and based on the two character features, determine the text features corresponding to the text "I”.
  • the first set is determined based on a text set
  • the text set may be a training sample set, or any set including multiple texts, which is not limited in the embodiments of the present disclosure.
  • the first set includes text identifiers corresponding to the first text. Therefore, when predicting the next paragraph of the first text, the electronic device can obtain information about subsequent texts associated with the first text in the first set, thereby assisting the electronic device in predicting the next paragraph of text and improving the accuracy of text prediction.
  • the text feature is associated with the frequency of various subsequent texts of the text in the text set. For example, if the frequency of the subsequent text in the text set is high, the electronic device can reduce the proportion of the feature corresponding to the subsequent text in the text feature; if the frequency of the subsequent text in the text set is low, the electronic device can increase the proportion of the feature corresponding to the subsequent text in the text feature. In this way, each text feature in the first set can incorporate more low-frequency words, thereby improving the accuracy of text prediction.
  • the frequency of the subsequent text in the text set is the frequency of the combination of the text and the subsequent text in the text set.
  • one of the subsequent texts of the text "I” is " ⁇ ". If the phrase " ⁇ " appears 1000 times in the text set, then the frequency corresponding to the subsequent text " ⁇ ” is determined to be 1000 times. If the text " ⁇ " appears in the text set, although the text " ⁇ ” also appears in the phrase, the text “ ⁇ ” has nothing to do with the text "I”. For the phrase " ⁇ ", it will not affect the frequency of the subsequent text " ⁇ ” of the text "I”.
  • S204 Determine text content associated with the first voice based on the first text and the first set.
  • the electronic device may determine the text content associated with the first voice based on the following feasible implementation method: based on the first text and the first set, determine the next second text of the first text; based on the second text and the first voice, determine the text content associated with the first voice.
  • the second text may be the next text of the first text. For example, if a text is "we”, if the first text is "I”, then the second text is "we”, and in the disclosed embodiment, the electronic device can predict the next text to be "we” based on the text "I”.
  • the second text is the next segment of text predicted by the electronic device based on the first text, and the second text is associated with the first speech.
  • the electronic device determines the next second text of the first text based on the first text and the first set, specifically by obtaining a first identifier of the first text, obtaining first text features associated with multiple subsequent texts of the first text in the first set based on the first identifier, and determining the second text based on the first text and the first text features.
  • the method for obtaining the first identifier of the first text is the same as step S203, and will not be repeated in detail in the embodiment of the present disclosure.
  • the first text feature may be features corresponding to multiple subsequent texts of the first text.
  • the first set may include identifiers corresponding to the first text
  • the first set may contain text features corresponding to multiple subsequent texts of the first text.
  • first text features associated with multiple subsequent texts of the first text are obtained in the first set, specifically: a target identifier identical to the first identifier is determined from multiple text identifiers in the first set, and the text feature corresponding to the target identifier is determined as the first text feature.
  • the first set may include multiple indexes (the index may be a key-value pair, i.e., a text identifier that may be included in the first set and a text feature corresponding to the text identifier), each index may indicate a corresponding relationship between a key value and a text feature, and after the electronic device obtains the key value corresponding to the first text, in the first The text feature corresponding to the key value is obtained from the set, and the text feature is determined as the first text feature.
  • the index may be a key-value pair, i.e., a text identifier that may be included in the first set and a text feature corresponding to the text identifier
  • each index may indicate a corresponding relationship between a key value and a text feature
  • the first set includes key value A-index of text feature A, key value B-index of text feature B, if the key value of the first text obtained by the electronic device is key value A, then the first text feature associated with multiple subsequent texts of the first text is text feature A, if the key value of the first text obtained by the electronic device is key value B, then the first text feature associated with multiple subsequent texts of the first text is text feature B.
  • FIG4 is a schematic diagram of a process for obtaining a first text feature provided by an embodiment of the present disclosure.
  • the first set includes n key values and text features corresponding to each key value.
  • the first set may include key value 1-text feature 1, key value 2-text feature 2, ..., key value n-text feature n, wherein the text feature is a feature associated with multiple subsequent texts of the text corresponding to the key value.
  • the electronic device can obtain text feature 2 corresponding to key value 2 in the first set, and determine text feature 2 as the first text feature associated with multiple subsequent texts of the first text. In this way, the electronic device can accurately obtain the context information of the first text in the first set, thereby improving the prediction accuracy of the next paragraph of the first text.
  • the electronic device determines the second text based on the first text and the first text feature, specifically: based on the first text, determines the context feature associated with the first text. For example, the electronic device can obtain the context feature associated with the first text based on the feature extraction model. For example, the sentence containing the first text includes character A, character B and character C. If the first text is character C, the electronic device can obtain the context feature associated with character C based on the feature extraction model, and the context feature includes information of character A, character B and character C.
  • the electronic device may determine the second text based on the following feasible implementation: fuse the first text feature and the context feature to obtain a fused feature, and determine the second text based on the fused feature.
  • the electronic device may fuse the first text feature and the second text feature through an attention mechanism to obtain a fused feature, and then use a text prediction module (which may be The convolution layer in the existing text prediction model, such as the Output layer of the text prediction model, is used to process the fused features to obtain the second text.
  • a text prediction module which may be The convolution layer in the existing text prediction model, such as the Output layer of the text prediction model, is used to process the fused features to obtain the second text.
  • the electronic device can obtain an accurate second text and improve the accuracy of text prediction.
  • FIG5 is a schematic diagram of a process for determining a second text provided by an embodiment of the present disclosure. Please refer to FIG5, which includes character A, character B, and character C.
  • Character A, character B, and character C are three characters in a sentence. Character A, character B, and character C are sampled to obtain sampling information, and the context feature corresponding to character C is obtained through the sampling information of character A, the sampling information of character B, and the sampling information of character C.
  • the context feature corresponding to character C and the first text feature corresponding to character C are processed by the attention mechanism to obtain the fused feature corresponding to character C, and the fused feature is processed by the text prediction module to obtain the next character of character C.
  • the electronic device when performing text prediction, not only obtains the context feature including the context information of character C, but also can obtain multiple subsequent character information of character C in the first set, thereby improving the accuracy of text prediction.
  • the electronic device determines the text content associated with the first voice based on the second text and the first voice, specifically by performing text recognition processing on the first voice to obtain the third text.
  • the electronic device can recognize the first voice through ASR technology to obtain the third text.
  • the electronic device can determine the second text or the third text as the text content associated with the first voice; if the second text and the third text are different, the electronic device can determine the text content associated with the first voice based on the distribution probability of the second text and the distribution probability of the third text. For example, if the distribution probability of the text recognized by ASR is 70%, and the distribution probability of the text predicted by the method of the embodiment of the present disclosure is 30%, the electronic device can randomly assign the second text or the third text to the first voice based on the above distribution probability.
  • the third text is determined to be text content associated with the first voice
  • the embodiment of the present disclosure can also determine the text content of the first speech based on the second text and the third text in other ways (such as modifying the third text based on the second text, etc.), and the embodiment of the present disclosure is not limited to this. In this way, since the second text is predicted based on more long-tail word information, the second text can assist the speech recognition model in recognizing the first speech, thereby improving the accuracy of speech recognition.
  • the second text of the embodiment of the present disclosure may also include multiple text information, and the text content of the first voice is determined through multiple text information and the third text.
  • the second text may include multiple subsequent texts of the first text, and then the text content of the first voice is determined through multiple subsequent texts and the third text.
  • the first text of the previous voice corresponding to the first voice is "we”
  • the third text of the first voice is "go”
  • the next second text of the first text may include the text "go”
  • the electronic device can determine that the text content of the first voice is "go” based on the multiple text information included in the second text.
  • the text content corresponding to the first voice is a long-tail word, even if the ASR cannot accurately recognize the text of the first voice, the text content recognized by the ASR can be corrected through the second text, thereby improving the accuracy of text recognition.
  • the disclosed embodiment provides a method for speech recognition, wherein an electronic device obtains a first speech and obtains a first text corresponding to a previous segment of speech of the first speech, the electronic device can obtain a first set, and based on the first text and the first set, determine a second segment of text that is the next segment of the first text, and the electronic device determines the text content associated with the first speech based on the second text and the first speech.
  • the first set can include more long-tail word information, and the electronic device can obtain more contextual information of the first text through the first text features, thereby improving the prediction accuracy of the next segment of text, and then accurately recognize the first speech in combination with the next segment of text, thereby improving the accuracy of speech recognition.
  • the above-mentioned speech recognition method further includes a method for obtaining a first set.
  • the method for obtaining a first set is described below in conjunction with FIG. 6 . bright.
  • FIG6 is a schematic diagram of a method for obtaining a first set provided by an embodiment of the present disclosure. Referring to FIG6 , the method flow includes:
  • S601 Obtain sample identifiers of multiple sample texts in a text set and sample text features corresponding to subsequent texts of the sample texts.
  • the electronic device needs to determine the size of the first set.
  • the electronic device can construct the first set as a matrix of U ⁇ M ⁇ d emb , where U can be the size of the set, for example, U is 10,000, and the first set can store text features corresponding to 10,000 texts, and M can be a preset hyperparameter, for example, M can indicate the number of vectors that can be allocated to each entry in the set, and the entry is used to store the text features of subsequent characters. The larger M is, the more text features of subsequent texts can be stored.
  • d emb can be the latitude of the text feature.
  • the text set may include multiple sample texts, and the electronic device may obtain sample identifiers corresponding to the multiple sample texts and sample text features corresponding to subsequent texts of the sample texts.
  • the sample identifier may be a key value corresponding to the sample text.
  • the sample text is character A
  • the sample identifier may be a mod operation between the character sequence number of character A and the number of texts in the text set. It should be noted that the method for obtaining the key value has been described in detail in step S202, and the embodiments of the present disclosure will not be repeated here.
  • the sample text feature may be a feature corresponding to the subsequent text of the sample text. For example, if the sentence is "we”, and the sample text is "I”, then the subsequent text of the sample text is "we”, and the sample text feature is the feature corresponding to "we” (embedding).
  • the electronic device can obtain the features corresponding to all subsequent texts of the sample text in the text set, and then obtain the sample text features. For example, for the text "I”, the electronic device can obtain the subsequent texts of the text "I” in the text set. If the text set includes the texts “we”, “I eat”, “I go” and “I come”, the electronic device determines that the subsequent texts of the text "I” include “we”, “eat”, “go” and “come”, and the electronic device can obtain the features corresponding to each subsequent text.
  • the text prediction model is trained, so that the features corresponding to each subsequent text can be obtained in the convolution layer in the text prediction model.
  • the parameters of the feature extraction network include the embedding corresponding to each text, and the electronic device can obtain the embedding of each subsequent text based on the parameters of the feature extraction network. In this way, the efficiency of feature acquisition can be improved.
  • S602 Determine an initial set based on the sample identifier and the sample text feature.
  • the initial set includes multiple sample identifiers and sample text features corresponding to each sample identifier.
  • the electronic device can establish an index of the set, and the initial set can be obtained after the index is established.
  • the electronic device can obtain the key value corresponding to each sample text, and associate the key value with the features of multiple subsequent texts of the sample text to obtain a key-value pair, so that each key value can be associated with an M ⁇ d emb matrix, which can store the subsequent text information of the text corresponding to the key value.
  • Fig. 7 is a schematic diagram of a process for determining an initial set provided by an embodiment of the present disclosure.
  • character 1 is taken as an example to illustrate obtaining the content associated with character 1 in the initial set, referring to Fig. 7, including a text set, wherein the text set includes character 1, character 2, character 3, ..., character n, wherein character 2, character 3, ..., character n are subsequent texts of character 1.
  • the electronic device can construct n key-value pairs corresponding to character 1, wherein the n key-value pairs may include key value A-features of character 2, key value A-features of character 3, ..., key value A-features of character n, thereby obtaining an initial set.
  • the matrix corresponding to key value A includes features of character 2, features of character 3, ..., features of character n.
  • S603 Based on the multiple sample texts, update the features of the multiple sample texts in the initial set to obtain a first set.
  • the electronic device may update the multiple sample text features in the initial set so that the sample text features may include It includes more low-frequency word information, thereby improving the accuracy of text prediction.
  • multiple sample text features in the initial set are updated, specifically: a first sample identifier corresponding to the first sample text and a sample subsequent text of the first sample text are obtained.
  • multiple sentences in a text set may include multiple sample subsequent texts corresponding to the first sample text.
  • each sentence in the text set needs to be used to update the initial set. Therefore, the electronic device can obtain any sample subsequent text of the first sample text, or determine the subsequent text of the first sample text in the sentence currently used for updating as the sample subsequent text.
  • the embodiments of the present disclosure are not limited to this.
  • the first sample text feature is determined in the initial set. For example, referring to FIG7 , if the first sample text is character 1, and the first sample identifier corresponding to the first sample text is key value A, then the first sample text feature is the matrix corresponding to key value A, that is, the first sample text feature includes the feature of character 2, the feature of character 3, ..., the feature of character n.
  • the first frequency of occurrence of the combined text of the sample subsequent text associated with the first sample text in the text set and the subsequent text feature corresponding to the sample subsequent text are obtained. For example, referring to FIG. 7 , if the first sample text is character 1 and the sample subsequent text is character 2, the electronic device can determine the feature of character 2 as the subsequent text feature, and the electronic device can obtain the number of occurrences of the combination of character 1-character 2 in the text set. If character 1-character 2 appears 100 times, the electronic device determines the first frequency to be 100 times.
  • the first sample text feature is updated based on the first frequency and the subsequent text feature.
  • the electronic device may update the first sample text feature based on the following feasible implementation: based on the first frequency, determine the update ratio of the first sample text feature, and based on the update ratio and the subsequent text feature, randomly update the vector in the first sample text feature.
  • the electronic device may randomly update the vector in the first sample text feature based on the following formula:
  • X k+1 follows Bernoulli distribution are all 1 ⁇ d emb vectors, representing the mth row in the initial dictionary indexed by the current key value i, where yes The updated vector; E k+1 represents the subsequent text feature; ⁇ represents the smoothness of the update.
  • the electronic device can randomly update 50% of the vectors in the first text feature based on the subsequent text feature.
  • the first text feature includes vector A and vector B. If P k+1 is 0.5, the electronic device can update any one of vector A or vector B based on the subsequent text feature.
  • is between 0-1. If ⁇ is 1, vector A is not updated. If ⁇ is 0, vector A is replaced with the subsequent text feature. In the actual application process, ⁇ can be 0.5. In this way, when updating vector A, part of the information of vector A and part of the information of the subsequent text feature can be superimposed, thereby improving the effect of feature fusion and the accuracy of text prediction.
  • the first set is obtained after the initial set is updated by each sample text in the text set. For example, if the text set includes 10,000 texts, the first set corresponding to the initial set is obtained after the initial set is updated 10,000 times by 10,000 texts.
  • FIG8 is a schematic diagram of a process for updating a first text feature provided by an embodiment of the present disclosure.
  • the first sample text feature includes the feature of character 2 and the feature of character 3 (indicating that the subsequent characters of character 1 in the text set are only character 2 and character 3), and the sentence includes character 1 (first sample text) and character 2 (sample subsequent text).
  • the electronic device determines that character 2 appears 100 times after character 1.
  • the electronic device determines that the first frequency of character 2 is 100. Therefore, the electronic device determines that the finer proportion corresponding to the first sample text feature is 50%.
  • the electronic device uses the feature of character 2 (subsequent text feature) to identify the feature of character 2 or character 3 in the first sample text feature. In this way, when the frequency of the sample subsequent text appears is low, the electronic device updates the first sample text feature at a higher rate, so that the first sample text can include more information of the sample subsequent text, thereby improving the accuracy of text prediction.
  • the disclosed embodiment provides a method for obtaining a first set, obtaining sample identifiers of multiple sample texts in a text set and sample text features corresponding to subsequent texts of the sample texts, determining an initial set based on the sample identifiers and the sample text features, wherein the initial set includes multiple sample identifiers and sample text features corresponding to each sample identifier, and updating multiple sample text features in the initial set based on multiple sample texts to obtain a first set.
  • the text features in the first set may include more low-frequency word information, and when the electronic device predicts the next paragraph of the text, the electronic device can not only obtain context information in the text, but also obtain more context information corresponding to the text in the first set, thereby improving the accuracy of text prediction and the accuracy of speech recognition.
  • FIG9 is a process diagram of a speech recognition method provided by an embodiment of the present disclosure.
  • the first set includes a key-value pair of key value 1-text feature 1, a key-value pair of key value 2-text feature 2, ..., a key-value pair of key value n-text feature 3.
  • voice A "hello” and voice B "beautiful” When a user inputs voice A "hello” and voice B "beautiful" to an electronic device, the electronic device can convert the voice A into the text "hello” through ASR technology, and determine the last character in the text as the first text. The electronic device determines the first text as "good", wherein voice A is the previous voice of voice B.
  • the electronic device obtains the key value of the text "good” as key value 1. Therefore, the electronic device can obtain text feature 2 corresponding to key value 1 in the first set based on key value 1, and determine text feature 2 as the first text feature corresponding to the text "good” (the text feature may include features of multiple subsequent texts of the text "good”).
  • the electronic device samples the text "you” and the text "good”, and based on the sampling information of the text "you” and the sampling information of the text "good”, obtains the contextual features associated with the text "good”.
  • the electronic device processes the contextual features and text features 1 through the attention mechanism to obtain fused features.
  • the electronic device can perform text prediction based on the fusion features, and then obtain the next character " ⁇ " of the text " ⁇ ". If the text recognized by voice B of the next voice of voice A " ⁇ " is also " ⁇ ", the electronic device can determine that the text content of voice B is ⁇ . In this way, when the electronic device performs text prediction, since the electronic device can not only obtain the context information of the first text based on the sentence, but also obtain more context information of the first text based on the first set, the electronic device can accurately predict the next paragraph of the first text, improve the accuracy of text prediction, and then accurately recognize the next paragraph of speech in combination with the next paragraph of text, thereby improving the accuracy of speech recognition.
  • FIG10 is a schematic diagram of the structure of a speech recognition device provided by an embodiment of the present disclosure.
  • the speech recognition device 100 includes a first acquisition module 101, a second acquisition module 102, a third acquisition module 103 and a determination module 104, wherein:
  • the first acquisition module 101 is used to acquire a first voice
  • the second acquisition module 102 is used to determine a first text corresponding to the first speech
  • the second acquisition module 103 is used to acquire a first set, the first set including a plurality of text identifiers and a text feature corresponding to each of the plurality of text identifiers, the text feature being a feature associated with a plurality of subsequent texts of the text corresponding to the text identifier, the text feature being associated with a frequency of the plurality of subsequent texts of the text in the text set, the first set being determined based on the text set;
  • the determination module 104 is configured to determine text content associated with the first speech based on the first text and the first set.
  • the determining module 104 is specifically configured to:
  • the determining module 104 is specifically configured to:
  • the second text is determined based on the first text and the first text feature.
  • the determining module 104 is specifically configured to:
  • the text feature corresponding to the target identifier is determined as the first text feature.
  • the determining module 104 is specifically configured to:
  • the second text is determined based on the context feature and the first text feature.
  • the determining module 104 is specifically configured to:
  • the second text is determined.
  • the determining module 104 is specifically configured to:
  • the third acquisition module 103 is specifically used for:
  • the sample identifier Based on the sample identifier and the sample text feature, determining an initial set, wherein the initial set includes a plurality of sample identifiers and a sample text feature corresponding to each sample identifier;
  • multiple sample text features in the initial set are updated to obtain the first set.
  • the third acquisition module 103 is specifically used for:
  • the first sample text feature is updated based on the first frequency and the subsequent text feature.
  • the third acquisition module 103 is specifically used for:
  • the vector in the first sample text feature is randomly updated.
  • the speech recognition device provided in the embodiment of the present disclosure can be used to execute the technical solution of the above-mentioned method embodiment. Its implementation principle and technical effect are similar, and this embodiment will not be repeated here.
  • FIG11 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 1100 may be an electronic device or an electronic device.
  • the electronic device may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (Portable Android Devices, PADs), portable multimedia players (Portable Media Players, PMPs), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • PDAs personal digital assistants
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG11 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 1100 may include a processing device (eg, a central processing unit, a graphics processing unit, etc.) 1101, which may be configured to store data stored in a read-only memory (READ ONLY MEMORY).
  • the processing device 1101, the ROM 1102 and the RAM 1103 are connected to each other via a bus 1104.
  • An input/output (I/O) interface 1105 is also connected to the bus 1104.
  • the following devices may be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 1107 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 1108 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 1109.
  • the communication device 1109 may allow the electronic device 1100 to communicate wirelessly or wired with other devices to exchange data.
  • FIG. 11 shows an electronic device 1100 having various devices, it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication device 1109, or installed from the storage device 1108, or installed from the ROM 1102.
  • the processing device 1101 the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the computer-readable medium of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in combination with an instruction execution system, device or device.
  • the program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • an embodiment of the present disclosure further includes a speech system, wherein the speech system includes a speech-to-text device and the first dictionary described in any of the above embodiments, wherein the speech-to-text device is used to convert speech information into text information.
  • an embodiment of the present disclosure further includes a computer-readable storage medium, in which computer-executable instructions are stored.
  • a processor executes the computer-executable instructions, the method described in any of the above embodiments is implemented.
  • an embodiment of the present disclosure further includes a computer program product, including a computer program, and when the computer program is executed by a processor, the method described in any of the above embodiments is implemented.
  • the electronic device in the embodiment of the present disclosure may include the above-mentioned voice system, computer-readable storage medium or computer program product.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer readable medium carries one or more programs.
  • the electronic device executes the above-mentioned embodiment. The method shown.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or electronic device.
  • the remote computer may be connected to the user's computer via any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware.
  • the name of a unit does not limit the unit itself in some cases.
  • the first acquisition unit may also be described as a "unit for acquiring at least two Internet Protocol addresses".
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Product
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing.
  • a more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the foregoing.
  • a prompt message is sent to the user to clearly prompt the user that the operation requested to be performed will require obtaining and using the user's personal information.
  • the user can autonomously choose whether to provide personal information to software or hardware such as a terminal device, application, electronic device, or storage medium that performs the operation of the technical solution of the present disclosure according to the prompt message.
  • the method of actively requesting and sending prompt information to the user may be, for example, a pop-up window, in which the prompt information may be presented in text form.
  • the pop-up window may also carry a selection control for the user to choose "agree” or "disagree” to provide personal information to the terminal device.
  • the data involved in this technical solution shall comply with the requirements of the relevant laws and regulations.
  • the data may include information, parameters and messages, such as flow switching indication information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)

Abstract

一种语音识别方法、装置及电子设备,该方法包括:获取第一语音(S201);获取第一语音的上一段语音对应的第一文本(S202);获取第一集合,第一集合包括多个文本标识以及与多个文本标识中的每个文本标识对应的文本特征,文本特征为文本标识对应的文本的多种后续文本相关联的特征,文本特征与文本的多种后续文本在文本集合内的频次相关联,第一集合是基于文本集合确定得到的(S203);基于第一文本和第一集合,确定第一语音相关联的文本内容(S204)。该方法能够提高语音识别的准确度。

Description

语音识别方法、装置及电子设备
本申请要求于2022年11月10日提交中国专利局、申请号为202211407243.8、发明名称为“语音识别方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及语音处理技术领域,尤其涉及一种语音识别方法、装置及电子设备。
背景技术
语音识别技术可以将语音信息转换为文本信息。例如,电子设备可以通过自动语音识别技术将一段语音转换为文本,并显示该语音对应的文本。
目前,在进行语音识别时,可以在语音识别模型中添加一个语言模型,该语言模型可以预测当前语音相关联的文本对应的下一段文本,通过语言模型可以辅助语音识别模型对下一段语音进行识别。但是,语音识别模型和语言模型的训练样本中的长尾词(即,使用频率较低的词)较少,使得语音识别模型和语言模型对长尾词的识别准确率较低,导致语音识别的准确度较低。
发明内容
本公开提供一种语音识别方法、装置及电子设备,用于解决现有技术中语音识别的准确度较低的技术问题。
第一方面,本公开提供一种语音识别方法,该语音识别方法包括:
获取第一语音;
获取所述第一语音的上一段语音对应的第一文本;
获取第一集合,所述第一集合包括多个文本标识以及与所述多 个文本标识中的每个文本标识对应的文本特征,所述文本特征为所述文本标识对应的文本的多种后续文本相关联的特征,所述文本特征与所述文本的多种后续文本在文本集合内的频次相关联,所述第一集合是基于所述文本集合确定得到的;
基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容。
第二方面,本公开提供一种语音识别装置,包括第一获取模块、第二获取模块、第三获取模块和确定模块,其中:
所述第一获取模块用于,获取第一语音;
所述第二获取模块用于,获取所述第一语音的上一段语音对应的第一文本;
所述第三获取模块用于,获取第一集合,所述第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征,所述文本特征为所述文本标识对应的文本的多种后续文本相关联的特征,所述文本特征与所述文本的多种后续文本在文本集合内的频次相关联,所述第一集合是基于所述文本集合确定得到的;
所述确定模块用于,基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容。
第三方面,本公开实施例提供一种电子设备包括:处理器和存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能涉及的所述语音识别方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能涉及的所述语音识别方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算 机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能涉及的所述语音识别方法。
本公开提供一种语音识别方法、装置及电子设备,电子设备可以获取第一语音,并获取第一语音的上一段语音对应的第一文本,电子设备可以获取第一集合,其中,第一集合包括多个文本标识以及与多个文本标识中的每个文本标识对应的文本特征,文本特征为文本标识对应的文本的多种后续文本相关联的特征,文本特征与文本的多种后续文本在文本集合内的频次相关联,第一集合是基于所述文本集合确定得到的,基于第一文本和第一集合,确定第一语音相关联的文本内容。在上述方法中,由于第一集合中的文本特征与文本的多种后续文本在文本集合内的频次相关联,因此,第一集合中的文本特征可以融合更多低频词的特征,电子设备可以通过第一集合得到更多的上下文信息,进而准确的对第一语音进行识别,提高语音识别的准确度。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种应用场景示意图;
图2为本公开实施例提供的一种语音识别方法的流程示意图;
图3A为本公开实施例提供的一种获取第一文本的过程示意图;
图3B为本公开实施例提供的另一种获取第一文本的过程示意图;
图4为本公开实施例提供的一种获取第一文本特征的过程示意图;
图5为本公开实施例提供的一种确定第二文本的过程示意图;
图6为本公开实施例提供的一种获取第一集合的方法示意图;
图7为本公开实施例提供的一种确定初始集合的过程示意图;
图8为本公开实施例提供的一种更新第一文本特征的过程示意图;
图9为本公开实施例提供的一种语音识别方法的过程示意图;
图10为本公开实施例提供的一种语音识别装置的结构示意图;以及,
图11为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
为了便于理解,下面,对本公开实施例涉及的概念进行说明。
电子设备:是一种具有无线收发功能的设备。电子设备可以部署在陆地上,包括室内或室外、手持、穿戴或车载。所述电子设备可以是手机(mobile phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)电子设备、增强现实(augmented reality,AR)电子设备、工业控制(industrial control)中的无线终端、车载电子设备、无人驾驶(self driving)中的无线终端、远程医疗(remote medical)中的无线电子设备、智能电网(smart grid)中的无线电子设备、运输安全(transportation safety)中的无线电子设备、智慧城市(smart city)中的无线电子设备、智慧家庭(smart home)中的无线电子设备、可穿戴电子设备等。本公开实施例所涉及的电子设备还可以称为终端、用户设备(user equipment,UE)、接入电子设备、车载终端、工业控制终端、UE单元、UE站、移动站、移 动台、远方站、远程电子设备、移动设备、UE电子设备、无线通信设备、UE代理或UE装置等。电子设备也可以是固定的或者移动的。
在相关技术中,在进行语音识别时,可以在语音识别模型中添加一个语言模型,该语言模型可以预测当前语音相关联的文本对应的下一段文本,进而通过语言模型辅助语音识别模型对下一段语音的识别。例如,在语音识别模型得到一个文本时,语言模型可以预测该文本的下一个文本,语音识别模型通过采集得到的下一段语音和语言模型预测的下一个文本,对下一段语音进行预测。
但是,语音识别模型和语言模型的训练集合中的样本通常为网络中获取的样本,而网络中获取的训练集合中的长尾词(使用频率较低的词)较少,语音识别模型和语言模型在训练的过程中无法学习到较多的长尾词信息,使得语音识别模型和语言模型对长尾词的识别准确率较低,导致语音识别的准确度较低。
为了解决上述技术问题,本公开实施例提供一种语音识别方法,电子设备获取第一语音,并获取第一语音的上一段语音对应的第一文本,电子设备可以获取第一集合,其中,第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征,该文本特征为文本标识对应的后续文本相关联的特征,后续文本在文本集合内的频次越低,文本特征内融合的该后续文本的特征越多,电子设备可以基于第一文本和第一集合,确定第一文本的下一段第二文本,并基于第二文本和第一语音,确定第一语音相关联的文本内容。这样,由于第一集合中的文本特征与文本的多种后续文本在文本集合内的频次相关联,因此,第一集合中可以包括较多的长尾词信息,电子设备可以通过第一文本特征得到第一文本较多的上下文信息,提高下一段文本(即,预测的第一语音的文本)的预测准确度,进而结合下一段文本准确的对第一语音进行识别,提高语音识别的准确度。
下面,结合图1,对本公开实施例的应用场景进行说明。
图1为本公开实施例提供的一种应用场景示意图。请参见图1, 语言模型、语音识别模型和第一集合。向语言模型中输入第一语音的上一段语音的第一文本,语言模型可以基于第一文本的标识,在第一集合中获取第一文本相关联的上下文信息,进而基于第一文本和第一文本相关联的上下文信息,预测第一文本的下一段文本,在语音识别模型获取到第一语音时,语音识别模型可以基于下一段文本和第一语音,确定第一语音的文本内容。这样,在通过第一文本预测下一段文本时,文本预测模型可以在第一集合中获取第一文本相关联的上下文信息,因此,文本预测模型可以准确的预测第一语音的上一段语音对应的下一段文本,进而辅助语音识别模型准确的识别第一语音,提高语音识别的准确度。
需要说明的是,图1只是以示例的形式示意本公开实施例的一种应用场景,并非对本公开实施例的应用场景的限定。
下面以具体地实施例对本公开的技术方案以及本公开的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本公开的实施例进行描述。
图2为本公开实施例提供的一种语音识别方法的流程示意图。请参见图2,该方法可以包括:
S201、获取第一语音。
本公开实施例的执行主体可以为电子设备,也可以为设置在电子设备中的语音识别装置。其中,语音识别装置可以通过软件实现,语音识别装置也可以通过软件和硬件的结合实现,本公开实施例对此不作限定。
可选的,第一语音可以为电子设备获取的任意语音。例如,第一语音可以为电子设备采集的用户的实时语音,第一语音也可以为电子设备内存中获取的已存储的语音,电子设备也可以通过其它电子设备获取第一语音,本公开实施例对此不作限定。例如,电子设备可以接收其它电子设备发送的语音,并将该语音确定为第一语音。
需要说明的是,第一语音可以为任意长度的语音。例如,第一 语音中可以包括1个音、2个音或3个音等,第一语音也可以为1秒的语音、2秒的语音或3秒的语音等,本公开实施例对此不作限定。
S202、获取第一语音的上一段语音对应的第一文本。
可选的,第一文本可以为文本字符。例如,第一文本可以为1个字符组成的文本,第一文本也可以为2个字符组成的文本,本公开实施例对此不作限定。可选的,第一语音的上一段语音可以包括一个或多个汉字、单词,本公开实施例对此不作限定。
可选的,电子设备在获取到第一语音时,可以基于第一语音获取第一语音的上一段语音,并且电子设备可以通过ASR技术对第一语音的上一段语音进行识别,进而得到第一文本。例如,若电子设备获取的第一语音的上一段语音为语音“你好”,则电子设备可以通过ASR技术对第一语音的上一段语音进行转换,得到文本“你好”。
可选的,电子设备可以将输入的文本确定为第一文本。例如,若用户向电子设备输入文本“你”,则电子设备可以将文本“你”确定为第一文本;若用户向电子设备输入文本“你好”,则电子设备可以将文本“你好”确定为第一文本;若用户向电子设备输入文本“今天天气真好”,则电子设备可以将文本“今天天气真好”确定为第一文本。
需要说明的是,在第一文本的字符数量较多时,电子设备可以将第一文本中的最后M(M大于1,且M小于第一文本的字符数量)个字符确定为第一文本,这样可以提高文本预测的准确度。例如,若用户向电子设备输入的文本为“今天天气真好”,则电子设备可以将文本“好”确定为第一文本,电子设备也可以将文本“真好”确定为第一文本,本公开实施例对此不作限定。
下面,结合图3A-图3B,对获取第一文本的过程进行说明。
图3A为本公开实施例提供的一种获取第一文本的过程示意图。请参见图3A,包括电子设备。电子设备的显示页面为用户B(电子设备的使用者)与用户A的聊天页面。用户A发送文本“今天天气真好”,用户B回复文本“是的”,用户B在键盘中输入文本“我 们去”,电子设备确定第一文本为“我们去”。
图3B为本公开实施例提供的另一种获取第一文本的过程示意图。请参见图3B,包括电子设备和用户。用户向电子设备发送的上一段语音为“今天天气真好”,电子设备接收到该语音后,可以通过ASR技术将该语音转换为文本“今天天气真好”,并确定第一文本为该文本的后4个字符“天气真好”,这样电子设备可以通过用户的语音获取第一文本,提高文本获取的灵活度和文本获取的效率。
S203、获取第一集合。
可选的,第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征。例如,第一集合可以包括多个对应关系,每个对应关系都包括一个文本标识和一个文本特征。例如,第一集合可以为字典,字典中可以包括多个键值对。
可选的,文本标识可以为文本的键值。例如,文本标识可以为文本字符相关联的键值。例如,若文本中包括1个字符,则该文本的文本标识为该字符对应的键值,若文本中包括2个字符,则可以基于2个字符,确定该文本对应的文本标识。例如,电子设备可以在文本集合内获取多个文本,进而确定每个文本对应的文本标识,并在第一集合的构造过程中将文本标识添加至第一集合。
可选的,针对于第一集合中任意一个文本,电子设备可以基于如下可行的实现方式,确定该文本的文本标识:确定文本中每个字符对应的字符标识和第一集合中的文本数量。可选的,字符标识可以为文本在第一集合中的序号,文本数量可以为第一集合中的文本总数。例如,若第一集合中可以包括10000段文本,则文本数量为10000。例如,第一集合中包括10000段文本,若文本A的序号为2000,文本B的序号为3000,则文本A的字符序号为2000,文本B的字符序号为3000。
可选的,基于字符标识和文本数量,确定文本标识。例如,若文本对应的字符标识为2000,文本数量为10000,则该文本对应的文本标识为2000mod 10000,其中,mod为取模运算符。这样,通 过该种方法可以确定多个文本对应的文本标识。
可选的,文本特征为文本标识对应的文本的多种后续文本相关联的特征。例如,在电子设备构造第一集合时,还需要获取每个文本相关联的后续文本对应的文本特征,针对于任意一个文本,后续文本为文本集合内该文本对应的每个后续文本。例如,针对于文本“我”,若文本集合中包括句子“我们一起去郊游”和“我吃过了”,则文本“我”的后续文本为文本“们”和文本“吃”,电子设备可以获取文本“们”对应的字符特征和文本“吃”对应的字符特征,并基于两个字符特征,确定文本“我”对应的文本特征。
需要说明的是,第一集合是基于文本集合确定得到的,文本集合可以为训练样本集合,或者任意一个包括多个文本的集合,本公开实施例对此不作限定。
需要说明的是,第一集合中包括第一文本对应的文本标识,因此,在对第一文本的下一段文本进行预测时,电子设备可以在第一集合中获取与第一文本相关联的后续文本的信息,进而辅助电子设备预测下一段文本,提高文本预测的准确度。
可选的,文本特征与文本的多种后续文本在文本集合内的频次相关联。例如,若后续文本在文本集合内的频次较高,则电子设备在文本特征中可以降低该后续文本对应的特征的占比,若后续文本在文本集合内的频次较低,则电子设备在文本特征中可以提高该后续文本对应的特征的占比。这样,第一集合中的每个文本特征都可以融合较多的低频词,进而提高文本预测的准确度。
需要说明的是,后续文本在文本集合内的频次为文本与后续文本的组合在文本集合内出现的频次。例如,文本“我”的一种后续文本为“们”,若词组“我们”在文本集合内出现1000次,则确定后续文本“们”对应的频次为1000次,若文本集合内出现文本“他们”,虽然该词组也出现文本“们”,但是该文本“们”与文本“我”无关,针对于词组“我们”,并不会影响文本“我”的后续文本“们”的出现频次。
S204、基于第一文本和第一集合,确定第一语音相关联的文本内容。
可选的,电子设备可以基于如下可行的实现方式,确定第一语音相关联的文本内容:基于第一文本和第一集合,确定第一文本的下一段第二文本,基于第二文本和第一语音,确定第一语音相关联的文本内容。
可选的,第二为本可以为第一文本的下一段文本。例如,一段文本为“我们”,若第一文本为“我”,则第二文本为“们”,本公开实施例中,电子设备可以基于文本“我”,预测出下一个文本为“们”。
需要说明的是,由于第一文本为第一语音的上一段语音对应的文本,因此,第二文本为电子设备基于第一文本预测的下一段文本,该第二文本与第一语音相关联。
可选的,电子设备基于第一文本和第一集合,确定第一文本的下一段第二文本,具体为:获取第一文本的第一标识,基于第一标识,在第一集合中获取第一文本的多种后续文本相关联的第一文本特征,基于第一文本和第一文本特征,确定第二文本。需要说明的是,获取第一文本的第一标识的方法与步骤S203相同,本公开实施例在此不再进行赘述。
可选的,第一文本特征可以为第一文本的多种后续文本对应的特征。例如,由于第一集合中可以包括第一文本对应的标识,因此,第一集合中存在第一文本的多种后续文本对应的文本特征。
可选的,基于第一标识,在第一集合中获取第一文本的多种后续文本相关联的第一文本特征,具体为:在第一集合中的多个文本标识中确定与第一标识相同的目标标识,将目标标识对应的文本特征,确定为第一文本特征。例如,在第一集合中可以包括多个索引(该索引可以为键值对,即第一集合中可以包括的文本标识和文本标识对应的文本特征),每个索引都可以指示一个键值和一个文本特征的对应关系,电子设备获取第一文本对应的键值之后,在第一 集合中获取该键值对应的文本特征,并将该文本特征确定为第一文本特征。例如,第一集合中包括键值A-文本特征A的索引、键值B-文本特征B的索引,若电子设备获取的第一文本的键值为键值A,则第一文本的多种后续文本相关联的第一文本特征为文本特征A,若电子设备获取的第一文本的键值为键值B,则第一文本的多种后续文本相关联的第一文本特征为文本特征B。
下面,结合图4,对获取第一文本特征的过程进行说明。
图4为本公开实施例提供的一种获取第一文本特征的过程示意图。请参见图4,包括第一集合。其中,第一集合中包括n个键值和每个键值对应的文本特征。例如,第一集合中可以包括键值1-文本特征1、键值2-文本特征2、……、键值n-文本特征n,其中,文本特征为键值对应的文本的多种后续文本相关联的特征。
请参见图4,若电子设备获取的第一文本对应的键值为键值2,则电子设备可以在第一集合中获取键值2对应的文本特征2,并将文本特征2确定为第一文本的多种后续文本相关联的第一文本特征。这样,电子设备可以在第一集合中准确的获取第一文本的上下文信息,进而提高第一文本的下一段文本的预测准确度。
可选的,电子设备基于第一文本和第一文本特征,确定第二文本,具体为:基于第一文本,确定第一文本相关联的上下文特征。例如,电子设备可以基于特征提取模型,获取第一文本相关联的上下文特征。例如,第一文本所在句子中包括字符A、字符B和字符C,若第一文本为字符C,则电子设备可以基于特征提取模型,获取字符C相关联的上下文特征,该上下文特征中包括字符A、字符B和字符C的信息。
基于上下文特征和第一文本特征,确定第二文本。可选的,电子设备可以基于如下可行的实现方式,确定第二文本:将第一文本特征和上下文特征进行融合,得到融合特征,基于融合特征,确定第二文本。例如,电子设备可以通过注意力机制对第一文本特征和第二文本特征进行融合,得到融合特征,进而通过文本预测模块(可 以为现有的文本预测模型中的卷积层,如,文本预测模型的Output层)对融合特征进行处理,得到第二文本,这样,由于融合特征中不仅包括第一文本的上下文特征,还包括第一集合中存储的上下文信息,而第一集合中的上下文信息中的低频词占比较高,因此,电子设备可以获取准确的第二文本,提高文本预测的准确度。
下面,结合图5,对确定第二文本的过程进行说明。
图5为本公开实施例提供的一种确定第二文本的过程示意图。请参见图5,包括字符A、字符B和字符C。其中,字符A、字符B和字符C为一段句子中的3个字符。对字符A、字符B和字符C进行采样,得到采样信息,通过字符A的采样信息、字符B的采样信息和字符C的采样信息,得到字符C对应的上下文特征。
请参见图5,通过注意力机制对字符C对应的上下文特征和字符C对应的第一文本特征(可以在第一集合中获取)进行处理,得到字符C对应的融合特征,通过文本预测模块对融合特征进行处理,进而得到字符C的下一个字符。这样,在进行文本预测时,电子设备不仅获取到包括字符C的上下文信息的上下文特征,还可以在第一集合中获取字符C的多个后续字符信息,进而提高文本预测的准确度。
可选的,电子设备基于第二文本和第一语音,确定第一语音相关联的文本内容,具体为:对第一语音进行文本识别处理,得到第三文本。例如,电子设备可以通过ASR技术对第一语音进行识别,得到第三文本。
基于第二文本和第三文本,确定第一语音相关联的文本内容。例如,若第二文本和第三文本相同,则电子设备可以将第二文本或第三文本确定为第一语音相关联的文本内容;若第二文本和第三文本不同,则电子设备可以根据第二文本的分布概率和第三文本的分布概率,确定第一语音相关联的文本内容。例如,若ASR识别的文本的分布概率为70%,本公开实施例的方法预测的文本的分布概率为30%,则电子设备可以根据上述分布概率,随机的将第二文本或 第三文本确定为第一语音相关联的文本内容,
需要说明的是,本公开实施例基于第二文本和第三文本,也可以通过其它的方式确定第一语音的文本内容(如,基于第二文本对第三文本进行修正等方式),本公开实施例对此不作限定。这样,由于第二文本是基于较多的长尾词信息预测得到的,因此,基于第二文本可以辅助语音识别模型对第一语音的识别,进而提高语音识别的准确度。
需要说明的是,本公开实施例的第二文本还可以包括多个文本信息,通过多个文本信息和第三文本,确定第一语音的文本内容。例如,第二文本可以包括第一文本的多个后续文本,进而通过多个后续文本和第三文本,确定第一语音的文本内容。例如,第一语音对应的上一个语音的第一文本为“我们”,第一语音的第三文本为“去”,而第一文本的下一段第二文本可以包括文本“去”、文本“吃”和文本“走”,电子设备可以基于第二文本包括的多个文本信息,确定第一语音的文本内容为“去”。这样,在第一语音对应的文本内容为长尾词时,即使ASR无法准确的识别第一语音的文本,还可以通过第二文本修正ASR识别的文本内容,进而提高文本识别的准确度。
本公开实施例提供一种语音识别方法,电子设备获取第一语音,并获取第一语音的上一段语音对应的第一文本,电子设备可以获取第一集合,并基于第一文本和第一集合,确定第一文本的下一段第二文本,电子设备基于第二文本和第一语音,确定第一语音相关联的文本内容。这样,由于第一集合中的文本特征与文本的多种后续文本在文本集合内的频次相关联,因此,第一集合中可以包括较多的长尾词信息,电子设备可以通过第一文本特征得到第一文本较多的上下文信息,提高下一段文本的预测准确度,进而结合下一段文本准确的对第一语音进行识别,提高语音识别的准确度。
在图2所示的实施例的基础上,上述语音识别方法中还包括获取第一集合的方法,下面,结合图6,对获取第一集合的方法进行说 明。
图6为本公开实施例提供的一种获取第一集合的方法示意图。请参见图6,该方法流程包括:
S601、获取文本集合内的多个样本文本的样本标识和样本文本的后续文本对应的样本文本特征。
可选的,在电子设备构造第一集合之前,电子设备需要确定第一集合的大小。例如,电子设备可以将第一集合构造为U×M×demb的矩阵,其中,U可以为集合的大小,例如,U为10000,第一集合中可以存储10000个文本对应的文本特征,M可以为预设的超参,例如,M可以指示集合中的每个条目可以分配的向量数量,条目用于存储后续字符的文本特征,M越大,能存储的后续文本的文本特征就越多。demb可以为文本特征的纬度。
可选的,文本集合中可以包括多个样本文本,电子设备可以获取多个样本文本对应的样本标识和样本文本的后续文本对应的样本文本特征。可选的,样本标识可以为样本文本对应的键值。例如,样本文本为字符A,则样本标识可以为字符A的字符序号与文本集合内的文本数量之间的mod运算,需要说明的是,获取键值的方法在步骤S202已详细说明,本公开实施例在此不再进行赘述。
可选的,样本文本特征可以为样本文本的后续文本对应的特征。例如,句子为“我们”,若样本文本为“我”,则样本文本的后续文本为“们”,样本文本特征为“们”对应的特征(embedding)。
可选的,针对于任意一个样本文本,电子设备可以获取样本文本在文本集合内的所有的后续文本对应的特征,进而得到样本文本特征。例如,针对于文本“我”,电子设备可以在文本集合中获取文本“我”的后续文本,若文本集合内包括文本“我们”、“我吃”、“我去”和“我来”,则电子设备确定文本“我”的后续文本包括“们”“吃”“去”和“来”,电子设备可以获取每个后续文本对应的特征。
可选的,在实际应用过程中,文本集合内的多个样本文本用于 训练文本预测模型,因此,可以在文本预测模型中的卷积层中获取每个后续文本对应的特征。例如,在通过文本集合内的多个样本文本训练文本预测模型时,特征提取网络的参数包括每个文本对应的embedding,电子设备可以基于特征提取网络的参数,获取每个后续文本的embedding。这样,可以提高特征获取的效率。
S602、基于样本标识和样本文本特征,确定初始集合。
可选的,初始集合中包括多个样本标识和每个样本标识对应的样本文本特征。例如,在电子设备获取到每个样本文本对应的样本标识和每个样本文本的多种后续文本对应的样本文本特征之后,电子设备可以建立集合的索引,在索引建立之后可以得到初始集合。例如,电子设备可以获取每个样本文本对应的键值,并将该键值与样本文本的多种后续文本的特征进行关联,得到键值对,这样,每个键值都可以关联一个M×demb的矩阵,该矩阵可以存储该键值对应文本的后续文本信息。
下面,结合图7,对确定初始集合的过程进行说明。
图7为本公开实施例提供的一种确定初始集合的过程示意图。图7所示的实施例中以字符1为例,对获取初始集合中字符1相关联的内容进行说明,请参见图7,包括文本集合,文本集合中包括字符1、字符2、字符3、……、字符n,其中,字符2、字符3、……、字符n为字符1的后续文本。
请参见图7,若字符1的键值为键值A,则电子设备可以构建字符1对应的n个键值对,其中,n个键值对可以包括键值A-字符2的特征、键值A-字符3的特征、……、键值A-字符n的特征,进而得到初始集合,该初始集合中,键值A对应的矩阵包括字符2的特征、字符3的特征、……、字符n的特征。
S603、基于多个样本文本,对初始集合中的多个样本文本特征进行更新,得到第一集合。
可选的,在电子设备获取初始集合之后,电子设备可以对初始集合中的多个样本文本特征进行更新,以使样本文本特征中可以包 括较多的低频词信息,进而提高文本预测的准确度。
可选的,针对于多个样本文本中任意一个第一样本文本,基于第一样本文本,对初始集合中的多个样本文本特征进行更新,具体为:获取第一样本文本对应的第一样本标识和第一样本文本的样本后续文本。例如,在实际应用过程中,文本集合内的多个句子中可以包括第一样本文本对应的多个样本后续文本,在对初始集合进行更新时,需要使用文本集合中的每个句子对初始集合进行更新,因此,电子设备可以获取第一样本文本的任意一个样本后续文本,也可以将当前用于更新的句子中的第一样本文本的后续文本确定为样本后续文本,本公开实施例对此不作限定。
可选的,基于第一样本标识,在初始集合中确定第一样本文本特征。例如,请参见图7,若第一样本文本为字符1,第一样本文本对应的第一样本标识为键值A,则第一样本文本特征为键值A对应的矩阵,即,第一样本文本特征中包括字符2的特征、字符3的特征、……、字符n的特征。
可选的,获取第一样本文本相关联的样本后续文本的组合文本在文本集合中出现的第一频次和样本后续文本对应的后续文本特征。例如,请参见图7,若第一样本文本为字符1,样本后续文本为字符2,则电子设备可以将字符2的特征确定为后续文本特征,并且电子设备可以在文本集合内获取字符1-字符2的组合出现的次数,若字符1-字符2出现100次,则电子设备确定第一频次为100次。
可选的,基于第一频次和后续文本特征,对第一样本文本特征进行更新。可选的,电子设备可以基于如下可行的实现方式,对第一样本文本特征进行更新:基于第一频次,确定第一样本文本特征的更新比例,基于更新比例和后续文本特征,对第一样本文本特征中的向量进行随机更新。
可选的,电子设备可以基于如下公式,对第一样本文本特征中的向量进行随机更新:
其中,Xk+1服从伯努利分布 都是1×demb的矢量,代表根据当前键值i索引到的初始字典里的第m行,其中更新之后的矢量;Ek+1代表后续文本特征;α代表更新的平滑度。
例如,若第一频次为100,则log100为2,Pk+1为0.5,因此,电子设备可以基于后续文本特征,对第一文本特征中的50%的向量进行随机更新。例如,第一文本特征中包括向量A和向量B,若Pk+1为0.5,则电子设备可以基于后续文本特征对向量A或向量B中的任意一个向量进行更新,在对向量A进行更新时,α在0-1之间,若α为1,则向量A没有更新,若α为0,则将向量A替换为后续文本特征,在实际应用的过程中,α可以为0.5,这样,在对向量A进行更新时,可以将向量A的部分信息和后续文本特征的部分信息进行叠加,进而提高特征融合的效果,进而提高文本预测的准确度。
可选的,在通过文本集合内的每个样本文本对初始集合更新之后,得到第一集合。例如,若文本集合内包括10000个文本,则通过10000个文本对初始集合更新10000次之后,得到初始集合对应的第一集合。
下面,结合图8,对更新第一文本特征的过程进行说明。
图8为本公开实施例提供的一种更新第一文本特征的过程示意图。请参见图8,包括字符1相关联的第一样本文本特征和文本集合内的句子。其中,第一样本文本特征中包括字符2的特征和字符3的特征(说明文本集合内字符1的后续字符只有字符2和字符3),句子中包括字符1(第一样本文本)和字符2(样本后续文本)。
请参见图8,电子设备确定字符2在字符1之后出现100次,电子设备确定字符2的第一频次为100,因此,电子设备确定第一样本文本特征对应的更细比例为50%,电子设备通过字符2的特征(后续文本特征),对第一样本文本特征内的字符2的特征或者字符3 的特征进行更新。这样,在样本后续文本出现的频次较低时,电子设备对第一样本文本特征更新的比例较高,使得第一样本文本中可以包括更多样本后续文本的信息,进而提高文本预测的准确度。
本公开实施例提供一种获取第一集合的方法,获取文本集合内的多个样本文本的样本标识和样本文本的后续文本对应的样本文本特征,基于样本标识和样本文本特征,确定初始集合,初始集合中包括多个样本标识和每个样本标识对应的样本文本特征,基于多个样本文本,对初始集合中的多个样本文本特征进行更新,得到第一集合。这样,第一集合中的文本特征可以包括较多的低频词信息,在电子设备对文本的下一段文本进行预测时,电子设备不仅可以在该文本中获取上下文信息,还可以在第一集合中获取该文本对应的更多的上下文信息,进而可以提高文本预测的准确度,提高语音识别的准确度。
下面,结合图9,对上述语音识别方法的过程进行说明。
图9为本公开实施例提供的一种语音识别方法的过程示意图。请参见图9,包括:电子设备、用户和第一集合。其中,第一集合中包括键值1-文本特征1的键值对、键值2-文本特征2的键值对、……、键值n-文本特征3的键值对。在用户向电子设备输入语音A“你好”和语音B“美”时,电子设备可以通过ASR技术将该语音A转换为文本“你好”,并将文本中最后一个字符确定为第一文本,电子设备确定第一文本为“好”,其中,语音A为语音B的上一段语音。
请参见图9,电子设备获取文本“好”的键值为键值1,因此,电子设备可以基于键值1,在第一集合中获取键值1对应的文本特征2,并将文本特征2确定为文本“好”对应的第一文本特征(该文本特征中可以包括文本“好”的多种后续文本的特征)。
请参见图9,电子设备对文本“你”和文本“好”进行采样处理,并基于文本“你”的采样信息和文本“好”的采样信息,得到文本“好”相关联的上下文特征,电子设备通过注意力机制对上下文特征和文本特征1进行处理,得到融合特征。
请参见图9,电子设备可以基于融合特征进行文本预测,进而得到文本“好”的下一个字符“美”。若语音A“你好”的下一个语音的语音B识别的文本也为“美”,则电子设备可以确定语音B的文本内容为美。这样,在电子设备进行文本预测时,由于电子设备不仅可以基于句子获取第一文本的上下文信息,还可以基于第一集合获取第一文本更多的上下文信息,因此,电子设备可以准确的预测第一文本的下一段文本,提高文本预测的准确度,进而结合下一段文本准确的对下一段语音进行识别,提高语音识别的准确度。
图10为本公开实施例提供的一种语音识别装置的结构示意图。请参见图10,该语音识别装置100包括第一获取模块101、第二获取模块102、第三获取模块103和确定模块104,其中:
所述第一获取模块101用于,获取第一语音;
所述第二获取模块102用于,确定与所述第一语音对应的第一文本;
所述第二获取模块103用于,获取第一集合,所述第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征,所述文本特征为所述文本标识对应的文本的多种后续文本相关联的特征,所述文本特征与所述文本的多种后续文本在文本集合内的频次相关联,所述第一集合是基于所述文本集合确定得到的;
所述确定模块104用于,基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容。
基于本公开一个或多个实施例,所述确定模块104具体用于:
基于所述第一文本和第一集合,确定所述第一文本的下一段第二文本;
基于所述第二文本和所述第一语音,确定所述第一语音相关联的文本内容。
基于本公开一个或多个实施例,所述确定模块104具体用于:
获取所述第一文本的第一标识;
基于所述第一标识,在所述第一集合中获取所述第一文本的多种后续文本相关联的第一文本特征;
基于所述第一文本和所述第一文本特征,确定所述第二文本。
基于本公开一个或多个实施例,所述确定模块104具体用于:
在所述第一集合中的多个文本标识中确定与所述第一标识相同的目标标识;
将所述目标标识对应的文本特征,确定为所述第一文本特征。
基于本公开一个或多个实施例,所述确定模块104具体用于:
基于所述第一文本,确定所述第一文本相关联的上下文特征;
基于所述上下文特征和所述第一文本特征,确定所述第二文本。
基于本公开一个或多个实施例,所述确定模块104具体用于:
将所述第一文本特征和所述上下文特征进行融合,得到融合特征;
基于所述融合特征,确定所述第二文本。
基于本公开一个或多个实施例,所述确定模块104具体用于:
对所述第一语音进行文本识别处理,得到第三文本;
基于所述第二文本和所述第三文本,确定所述第一语音相关联的文本内容。
基于本公开一个或多个实施例,所述第三获取模块103具体用于:
获取文本集合内的多个样本文本的样本标识和样本文本的后续文本对应的样本文本特征;
基于所述样本标识和所述样本文本特征,确定初始集合,所述初始集合中包括多个样本标识和每个样本标识对应的样本文本特征;
基于所述多个样本文本,对所述初始集合中的多个样本文本特征进行更新,得到所述第一集合。
基于本公开一个或多个实施例,所述第三获取模块103具体用于:
获取所述第一样本文本对应的第一样本标识和所述第一样本文本的样本后续文本;
基于所述第一样本标识,在所述初始集合中确定第一样本文本特征;
获取所述第一样本文本相关联的样本后续文本的组合文本在所述文本集合中出现的第一频次和所述样本后续文本对应的后续文本特征;
基于所述第一频次和所述后续文本特征,对所述第一样本文本特征进行更新。
基于本公开一个或多个实施例,所述第三获取模块103具体用于:
基于所述第一频次,确定所述第一样本文本特征的更新比例;
基于所述更新比例和所述后续文本特征,对所述第一样本文本特征中的向量进行随机更新。
本公开实施例提供的语音识别装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图11为本公开实施例提供的一种电子设备的结构示意图。请参见图11,其示出了适于用来实现本公开实施例的电子设备1100的结构示意图,该电子设备1100可以为电子设备或电子设备。其中,电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图11示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图11所示,电子设备1100可以包括处理装置(例如中央处理器、图形处理器等)1101,其可以根据存储在只读存储器(Read Only  Memory,简称ROM)1102中的程序或者从存储装置1108加载到随机访问存储器(Random Access Memory,简称RAM)1103中的程序而执行各种适当的动作和处理。在RAM 1103中,还存储有电子设备1100操作所需的各种程序和数据。处理装置1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。
通常,以下装置可以连接至I/O接口1105:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1106;包括例如液晶显示器(Liquid Crystal Display,简称LCD)、扬声器、振动器等的输出装置1107;包括例如磁带、硬盘等的存储装置1108;以及通信装置1109。通信装置1109可以允许电子设备1100与其他设备进行无线或有线通信以交换数据。虽然图11示出了具有各种装置的电子设备1100,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1109从网络上被下载和安装,或者从存储装置1108被安装,或者从ROM 1102被安装。在该计算机程序被处理装置1101执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器 (RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
可选的,本公开实施例还包括一种语音系统,所述语音系统中包括语音转文本装置和上述任意实施例所述的第一词典,所述语音转文本装置用于将语音信息转换为文本信息。
可选的,本公开实施例还包括一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上述任意实施例所述方法。
可选的,本公开实施例还包括一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任意实施例所述方法。
可选的,本公开实施例中的电子设备可以包括上述语音系统、计算机可读存储介质或计算机程序产品。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例 所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或电子设备上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑 部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的终端设备、应用程序、电子设备或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限定性的实现方式,响应于接收到用户的 主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向终端设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。数据可以包括信息、参数和消息等,如切流指示信息。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (13)

  1. 一种语音识别方法,其特征在于,包括:
    获取第一语音;
    获取所述第一语音的上一段语音对应的第一文本;
    获取第一集合,所述第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征,所述文本特征为所述文本标识对应的文本的多种后续文本相关联的特征,所述文本特征与所述文本的多种后续文本在文本集合内的频次相关联,所述第一集合是基于所述文本集合确定得到的;
    基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容,包括:
    基于所述第一文本和第一集合,确定所述第一文本的下一段第二文本;
    基于所述第二文本和所述第一语音,确定所述第一语音相关联的文本内容。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述第一文本和第一集合,确定所述第一文本的下一段第二文本,包括:
    获取所述第一文本的第一标识;
    基于所述第一标识,在所述第一集合中获取所述第一文本的多种后续文本相关联的第一文本特征;
    基于所述第一文本和所述第一文本特征,确定所述第二文本。
  4. 根据权利要求3所述的方法,其特征在于,基于所述第一标识,在所述第一集合中获取所述第一文本的多种后续文本相关联的第一文本特征,包括:
    在所述第一集合中的多个文本标识中确定与所述第一标识相同 的目标标识;
    将所述目标标识对应的文本特征,确定为所述第一文本特征。
  5. 根据权利要求3所述的方法,其特征在于,基于所述第一文本和所述第一文本特征,确定所述第二文本,包括:
    基于所述第一文本,确定所述第一文本相关联的上下文特征;
    基于所述上下文特征和所述第一文本特征,确定所述第二文本。
  6. 根据权利要求5所述的方法,其特征在于,基于所述上下文特征和所述第一文本特征,确定所述第二文本,包括:
    将所述第一文本特征和所述上下文特征进行融合,得到融合特征;
    基于所述融合特征,确定所述第二文本。
  7. 根据权利要求2-6任一项所述的方法,其特征在于,所述基于所述第二文本和所述第一语音,确定所述第一语音相关联的文本内容,包括:
    对所述第一语音进行文本识别处理,得到第三文本;
    基于所述第二文本和所述第三文本,确定所述第一语音相关联的文本内容。
  8. 根据权利要求1-6任一项所述的方法,其特征在于,获取第一集合,包括:
    获取文本集合内的多个样本文本的样本标识和样本文本的后续文本对应的样本文本特征;
    基于所述样本标识和所述样本文本特征,确定初始集合,所述初始集合中包括多个样本标识和每个样本标识对应的样本文本特征;
    基于所述多个样本文本,对所述初始集合中的多个样本文本特征进行更新,得到所述第一集合。
  9. 根据权利要求8所述的方法,其特征在于,针对于所述多个样本文本中的任意一个第一样本文本;基于所述第一样本文本,对所述初始集合中的多个样本文本特征进行更新,包括:
    获取所述第一样本文本对应的第一样本标识和所述第一样本文本的样本后续文本;
    基于所述第一样本标识,在所述初始集合中确定第一样本文本特征;
    获取所述第一样本文本相关联的样本后续文本的组合文本在所述文本集合中出现的第一频次和所述样本后续文本对应的后续文本特征;
    基于所述第一频次和所述后续文本特征,对所述第一样本文本特征进行更新。
  10. 根据权利要求9所述的方法,其特征在于,基于所述第一频次和所述后续文本特征,对所述第一样本文本特征进行更新,包括:
    基于所述第一频次,确定所述第一样本文本特征的更新比例;
    基于所述更新比例和所述后续文本特征,对所述第一样本文本特征中的向量进行随机更新。
  11. 一种语音识别装置,其特征在于,包括第一获取模块、第二获取模块、第三获取模块和确定模块,其中:
    所述第一获取模块用于,获取第一语音;
    所述第二获取模块用于,获取所述第一语音的上一段语音对应的第一文本;
    所述第三获取模块用于,获取第一集合,所述第一集合包括多个文本标识以及与所述多个文本标识中的每个文本标识对应的文本特征,所述文本特征为所述文本标识对应的文本的多种后续文本相关联的特征,所述文本特征与所述文本的多种后续文本在文本集合内的频次相关联,所述第一集合是基于所述文本集合确定得到的;
    所述确定模块用于,基于所述第一文本和所述第一集合,确定所述第一语音相关联的文本内容。
  12. 一种电子设备,其特征在于,包括:处理器和存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,使得所述 处理器执行如权利要求1-10任一项所述的语音识别方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1-10任一项所述的语音识别方法。
PCT/CN2023/125743 2022-11-10 2023-10-20 语音识别方法、装置及电子设备 WO2024099055A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211407243.8 2022-11-10
CN202211407243.8A CN118016058A (zh) 2022-11-10 2022-11-10 语音识别方法、装置及电子设备

Publications (1)

Publication Number Publication Date
WO2024099055A1 true WO2024099055A1 (zh) 2024-05-16

Family

ID=90957332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/125743 WO2024099055A1 (zh) 2022-11-10 2023-10-20 语音识别方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN118016058A (zh)
WO (1) WO2024099055A1 (zh)

Also Published As

Publication number Publication date
CN118016058A (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
US8682640B2 (en) Self-configuring language translation device
US20220075932A1 (en) Method and apparatus for inserting information into online document
US11270690B2 (en) Method and apparatus for waking up device
US11783808B2 (en) Audio content recognition method and apparatus, and device and computer-readable medium
WO2021259300A1 (zh) 音效添加方法和装置、存储介质和电子设备
CN109543154B (zh) 表格数据的类型转换方法、装置、存储介质及电子设备
CN111143523B (zh) 意图确认方法及装置
CN116863935B (zh) 语音识别方法、装置、电子设备与计算机可读介质
CN113204977A (zh) 信息翻译方法、装置、设备和存储介质
WO2020224294A1 (zh) 用于处理信息的系统、方法和装置
CN116894188A (zh) 业务标签集更新方法、装置、介质及电子设备
US20240079002A1 (en) Minutes of meeting processing method and apparatus, device, and medium
CN113220281A (zh) 一种信息生成方法、装置、终端设备及存储介质
WO2019000879A1 (zh) 导航方法、装置、设备及计算机可读存储介质
WO2024099055A1 (zh) 语音识别方法、装置及电子设备
CN112242143B (zh) 一种语音交互方法、装置、终端设备及存储介质
JP2021108095A (ja) スピーチ理解における解析異常の情報を出力するための方法
US20240105162A1 (en) Method for training model, speech recognition method, apparatus, medium, and device
CN115171695A (zh) 语音识别方法、装置、电子设备和计算机可读介质
CN112685996B (zh) 文本标点预测方法、装置、可读介质和电子设备
US20230281983A1 (en) Image recognition method and apparatus, electronic device, and computer-readable medium
CN112820280A (zh) 规则语言模型的生成方法及装置
JP2023514863A (ja) 情報を交換するための方法及び装置
WO2024120027A1 (zh) 语音处理方法、装置及电子设备
CN111339770A (zh) 用于输出信息的方法和装置