WO2019218818A1 - Procédé et appareil de reconnaissance vocale, support de stockage lisible par ordinateur et dispositif informatique - Google Patents

Procédé et appareil de reconnaissance vocale, support de stockage lisible par ordinateur et dispositif informatique Download PDF

Info

Publication number
WO2019218818A1
WO2019218818A1 PCT/CN2019/082300 CN2019082300W WO2019218818A1 WO 2019218818 A1 WO2019218818 A1 WO 2019218818A1 CN 2019082300 W CN2019082300 W CN 2019082300W WO 2019218818 A1 WO2019218818 A1 WO 2019218818A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate word
score
word sequence
word
neural network
Prior art date
Application number
PCT/CN2019/082300
Other languages
English (en)
Chinese (zh)
Inventor
刘毅
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019218818A1 publication Critical patent/WO2019218818A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, computer readable storage medium, and computer device.
  • the technical solution of speech recognition is generally divided into two parts: front-end decoding and back-end processing.
  • the front-end is mainly responsible for receiving input speech data and decoding the speech data to obtain multiple sentences with possibility, and the back end can be One of the sentences of the plurality of possibilities obtained by the front end determines one of the sentences as the final speech recognition result.
  • the back end can input multiple possible sentences into the neural network to determine the final speech recognition result.
  • a large amount of text needs to be utilized, and it takes a long time to train.
  • the neural network that can eventually be put into use, so this speech recognition scheme is less efficient.
  • a speech recognition method is applied to a computer device, and the method includes:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • a speech recognition device comprising:
  • a word sequence obtaining module configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence
  • An extraction module configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence
  • a domain identification module configured to identify an area in which the extracted candidate word sequence is located
  • a re-scoring module configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;
  • a speech recognition result determining module configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.
  • a computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • a computer readable storage medium storing a computer program, when executed by a processor, causes the processor to perform the following operations:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • the above-mentioned speech recognition method, apparatus, computer readable storage medium and computer device select a plurality of candidate word sequences from a plurality of word sequences by acquiring a plurality of word sequences obtained by decoding the speech data and a corresponding first score
  • the candidate word sequence may be input into the neural network corresponding to the domain, and the second score obtained by the neural network for re-rating each candidate word sequence may be obtained according to the first score and the first score.
  • the two scores determine the final score of each candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.
  • the method for speech recognition first identifies the domain of the candidate word sequence before re-rating the candidate word sequence, so that the candidate word sequence can be re-scored using the neural network corresponding to the domain to which the candidate word sequence belongs, using the domain
  • the corresponding neural network re-scores the candidate word sequence to obtain a second score that is not only more accurate. Since each field has its own corresponding neural network, each field uses its own text to train each corresponding neural network. No need to use all the text for training, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
  • 1 is an application environment diagram of a voice recognition method in an embodiment
  • FIG. 2 is a schematic flow chart of a voice recognition method in an embodiment
  • step 206 is a schematic flow chart of step 206 in an embodiment
  • FIG. 4 is a schematic flow chart of a training step of a neural network in an embodiment
  • FIG. 5 is a schematic diagram of input data and output data of a neural network in an embodiment
  • FIG. 6 is a schematic flow chart after step 214 in an embodiment
  • FIG. 7 is a schematic flow chart of a voice recognition method in another embodiment
  • FIG. 8 is a schematic diagram of a neural network training process corresponding to each field in an embodiment
  • FIG. 9 is a schematic flow chart of decoding a voice data by a front end in an embodiment
  • Figure 10 is a schematic illustration of a sequence of words in one embodiment
  • 11 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in an embodiment
  • FIG. 12 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in another embodiment
  • Figure 13 is a block diagram showing the structure of a voice recognition apparatus in an embodiment
  • FIG. 14 is a structural block diagram of a domain identification module in another embodiment
  • Figure 15 is a block diagram showing the structure of a computer device in an embodiment.
  • FIG. 1 is an application environment diagram of a voice recognition method in an embodiment.
  • the speech recognition method is applied to a speech recognition system.
  • the speech recognition system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically include at least one of a mobile phone, a tablet computer, a notebook computer, and the like.
  • the server 120 can be implemented by a stand-alone server or a server cluster composed of a plurality of servers.
  • a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 2, the voice recognition method specifically includes the following steps:
  • Step 202 Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the server can obtain a plurality of word sequences obtained by decoding the voice data, and the decoding operation can be performed by the terminal.
  • the terminal can use the trained acoustic model, the language model, and the pronunciation dictionary to obtain the obtained
  • the speech data is decoded so that a plurality of word sequences can be obtained.
  • the decoding operation can also be performed by the server.
  • the word sequence refers to a plurality of words obtained by decoding the voice data and a path corresponding to the plurality of words, and sentences obtained by orderly connecting the corresponding positions of each path to the corresponding words at the end position can be seen.
  • the result is a recognition result, that is, each word sequence can be regarded as a recognition result.
  • each path of each word sequence has a corresponding score, that is, the first score.
  • the first score can be thought of as a score calculated from the probability of occurrence of each path.
  • Step 204 Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
  • Each word sequence has a corresponding first score, and the sequence of words with the first score can be extracted.
  • the first score is preceded by sorting the sequence of words according to the order of the first score from high to low, that is, the sequence of words with the highest score of the first score can be ranked first.
  • the sequence of words with the first score is extracted, that is, the sequence of words with the first score is extracted, and the number of extracted word sequences can be preset, so that the preset number of word sequences of the first score can be extracted as candidates.
  • the sequence of words, the preset number can be adjusted according to the consideration of the technician.
  • the sequence of words is sorted in order from lowest to highest, and the sequence of words with the highest highest score is ranked at the last position.
  • the predetermined number of word sequences with the first score is extracted, that is, starting from the last position, and the preset number of word sequences are extracted according to the order of the word sequences.
  • Step 206 Identify the domain in which the extracted candidate word sequence is located.
  • Step 208 Input each candidate word sequence into the neural network of the corresponding domain according to the identified domain.
  • the field in which the predetermined number of candidate word sequences are located can be identified.
  • the field can be customized by the technician, such as dividing the field into navigation, music, and so on.
  • There are multiple candidate word sequences and the identified domain may be one, that is, multiple candidate word sequences belong to the same domain. After identifying the domain corresponding to the candidate word sequence, multiple candidate word sequences may be input to the corresponding domain respectively.
  • the neural network That is, for each domain, there is a corresponding neural network, and the neural network of each domain can re-scoring the candidate word sequences of the respective domains in a targeted manner.
  • Step 210 Re-score each candidate word sequence by a neural network to obtain a second score corresponding to each candidate word sequence.
  • Step 212 Obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence.
  • Step 214 the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • the neural network in the "navigation" field can re-score each input candidate word sequence to obtain a second score for each candidate word sequence.
  • the re-scoring means that the neural network corresponding to the domain of the candidate word sequence performs a score calculation on each candidate word sequence input, and obtains a second score of each candidate word sequence, and the first score of each candidate word sequence is The second score is combined and calculated in a certain manner. For example, the first score and the second score may be separately weighted, and the weighted calculated first score and the second score may be added. The final score of each candidate word sequence can be obtained according to the first score and the second score, and the candidate word sequence with the highest final score can be used as the speech recognition result of the speech data.
  • each candidate word sequence may be input into a corresponding neural network of the domain, and after obtaining a second score obtained by the neural network re-rating each candidate word sequence, each of the first score and the second score may be determined.
  • the final score of the candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.
  • the method for speech recognition first identifies the field in which the candidate word sequence is located before re-scoring the candidate word sequence, so that the candidate word sequence can be re-scored using the corresponding neural network in the field, and the corresponding nerve in the field is used.
  • the second score obtained by the network to score the candidate word sequence is not only more accurate, but each domain has its own corresponding neural network, and each field uses its own text to train the corresponding neural network without using all The text is trained, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
  • step 206 includes:
  • Step 302 input each candidate word sequence into a semantic classification model.
  • Step 304 Classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
  • Step 306 Acquire an area corresponding to the category label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
  • each candidate word sequence can be input into the trained semantic classification model.
  • the trained semantic classification model refers to a semantic classification model trained in a large number of training samples in advance, and is used to identify the field in which the input candidate word sequence is located.
  • the trained semantic classification model can classify each candidate word sequence, and output the classification of each candidate word sequence to obtain The classification label can determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.
  • the trained semantic classification model may have different classification results for each candidate word sequence, and represents a trained semantic classification model to determine the sequence of each candidate word sequence.
  • the fields are different.
  • the classification label of each candidate word sequence in order to determine the domain of the candidate word sequence, the classification label of each candidate word sequence can be obtained, and the classification label with the largest proportion is used as the classification label corresponding to all the candidate word sequences, that is, the classification label with the largest proportion.
  • the corresponding field is the field in which all candidate word sequences are located.
  • the classification label of the candidate word sequence A is 1, the classification label of the candidate word sequence B is 1, the classification label of the candidate word sequence C is 2, and the classification label of the candidate word sequence D is 1, wherein 1 can be represented as a navigation class. 2 is represented as music, then in the candidate word sequences A, B, C, D, the classification label is the largest proportion of 1, so the classification label 1 can be used as the classification label of the candidate word sequences A, B, C, D.
  • the domain in which the candidate word sequences A, B, C, and D are considered to belong to the category label is a navigation class.
  • the domain of the extracted candidate word sequences is identified by the semantic classification model, and when there is a divergence in the domain of each candidate word sequence, the domain corresponding to the classification tag with the largest proportion of the classification tags is used as the field in which the candidate word sequence is located.
  • the accuracy of the domain of the candidate word sequence is ensured, and the accuracy of the re-scoring score of the subsequent candidate word sequence is also improved, thereby improving the accuracy of the speech recognition result.
  • the voice recognition method further includes: acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively inputting a word vector corresponding to each domain as an input to each field The corresponding neural network is trained.
  • each neural network Before the actual use of the neural network, each neural network can be trained according to actual needs, and the trained neural network can be put into actual use to identify the domain of the input candidate word sequence. Since each domain has its own corresponding neural network, it is necessary to obtain corresponding texts of various fields when training neural networks in each field.
  • the text can be a sentence. After each sentence of each field is obtained, for each field, each sentence corresponding to the field can be processed by word segmentation, and multiple words corresponding to each sentence can be obtained. Convert the words contained in each sentence into word vectors, as the input data of the corresponding neural network in the field, and train the corresponding neural network in each field in this way. After training, each field can be obtained. Train a good neural network.
  • Pre-training neural networks in various fields ensures the accuracy of re-scoring candidate word sequences in the actual use of neural networks, thereby improving the accuracy of speech recognition and improving the efficiency of speech recognition.
  • the word vector is taken as an input, and the neural network corresponding to each domain is trained, including: for each domain, the words corresponding to each word in the text according to the order of the words in the corresponding text of the domain As an input, the vector takes the word vector corresponding to the next word of each input word as an output, and adjusts the parameters of the corresponding neural network in the field to train the neural network.
  • the words contained in the text can be converted into word vectors, and the word vectors corresponding to the texts of each field are input into the corresponding neural network of each field, and the corresponding neural network is performed. training.
  • the word vector of each input word may be input as the order of the words in the text, and the word vector corresponding to the next word of each input word may be output.
  • the input data corresponding to each output word is not only the word input at the previous moment, but the words input at the previous moment and all previous moments.
  • the parameters of the neural network are adjusted to train the neural network, so that the trained neural network corresponding to each field can be obtained, and the trained score can be obtained when the trained neural network re-scores the candidate word sequence. Reliability.
  • the voice recognition method further includes a training step of the neural network, including:
  • Step 402 Acquire text corresponding to each field.
  • Step 404 converting each word in the text corresponding to each field into a word vector.
  • Step 406 For each domain, according to the order of the words in the text corresponding to the domain, input the word vector corresponding to each word in the text as an input, and output the word vector corresponding to the next word of each input word as an output.
  • the neural network is trained by adjusting the parameters of the corresponding neural network in the field.
  • the neural network can be trained in advance, and the trained neural network can be obtained and then applied.
  • Each domain has its own corresponding neural network, that is, it needs to train the corresponding neural network in each field separately.
  • the word vector can be an N-dimensional vector data. N is a positive integer.
  • the text corresponding to each field can be regarded as the corresponding sentence of each field. You can use the word segmentation tool to divide the words into each sentence, that is, each sentence will have multiple words, and the words contained in each sentence can be converted.
  • the word vector is input to the corresponding neural network of each field.
  • the word vector corresponding to the word contained in each sentence is used as the input of the neural network, and the next word of the input word in each sentence is used as the output of the word vector of the previous word.
  • the output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before.
  • the input data and output data of the neural network are set, and the parameters of the neural network are continuously adjusted to train the neural network.
  • sentence A is: I was late for school yesterday, then I can use the word segmentation tool to split the sentence into: I / Yesterday / school / late / /.
  • the word vectors corresponding to the words contained in the sentence A are sequentially input into the neural network, where x1, x2, x3, x4, x5, and x6 are respectively words.
  • the corresponding word vector is the input data.
  • a blank input is added before the input word vector, that is, after the word vector corresponding to the word of sentence A is input to the neural network, the word vector corresponding to "I" is not the first input, but becomes the first
  • the two input data, the first input data is empty by default.
  • the word vector corresponding to the next word of each input word is used as the corresponding output data, that is, the output data corresponding to the first default empty input data is the word vector of the word “I”, and the input word “
  • the corresponding output data of my is the word vector of the word "yesterday”.
  • x1, x2, x3, x4, x5, and x6 correspond to "blank words”, “me”, “yesterday”, “going to school”, “late arrival”, “out”, and y1, y2, respectively.
  • Y3, y4, y5, and y6 are the output data corresponding to the input word vectors, respectively, and the corresponding words are “I”, “Yesterday”, “Going to School”, “Late”, “Yes”, that is, the word “School” corresponds to
  • the output data is the word vector corresponding to its next word "late”.
  • the input data only has the words "I” and the default blank words, but for the output data is "a”, the input data is all the words entered before this, that is, "I”
  • the words “yesterday”, “going to school”, “late” and the default blank words are the input data of the word " ⁇ ". That is to say, each output word vector is related to the input of the previous moment and the input of all previous moments. That is, each output word is related to the input word at the previous moment and the words entered at all previous times.
  • the training process of the neural network is a process of constantly adjusting the parameters of the neural network.
  • the neural network can be considered to have been trained.
  • the parameters of the neural network are adjusted until the technician determines that a certain parameter is the required parameter, and the neural network is trained.
  • a verification mode may be adopted, that is, after the parameters of the neural network are set, the neural network is verified by using a large amount of verified sample data.
  • the verification data is input into the neural network, and the prediction accuracy of the verification data by the neural network under the parameter is detected.
  • the prediction preparation rate of the neural network reaches the standard value set by the technician, the parameter can be considered as satisfying the requirement. Parameters, otherwise you need to continue to adjust the parameters of the neural network until the neural network passes the verification.
  • Training the neural network corresponding to each field through a large number of texts corresponding to each field ensures the reliability of the trained neural network, and improves the efficiency of the neural network for data processing, thereby improving the efficiency and accuracy of speech recognition. degree.
  • the final score of the candidate word sequence is obtained according to the first score and the second score corresponding to the candidate word sequence, including: performing weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a candidate The final score of the word sequence.
  • the terminal may perform decoding operation on the voice data, and may obtain a plurality of word sequences, each word sequence having a corresponding first score, and selecting a preset number of word sequences from the plurality of word sequences.
  • a candidate word sequence that is, each candidate word sequence has a corresponding first score.
  • the trained neural network will re-score each candidate word sequence to obtain a second score, and each candidate word sequence has its own Corresponding second score. After obtaining the first score and the second score of each candidate word sequence, a final score for each candidate word sequence can be obtained.
  • the weighting calculation may be performed, and the weights of the first score and the second score may be the same or different, and may depend on the setting of the technician. For example, if the technician thinks that the accuracy of the first score is higher, or the impact on the speech recognition result is greater, the weight of the first score may be increased. After the weighting calculation is performed on both the first score and the second score, the first score and the second score weighted score may be summed to obtain a final score of each candidate word sequence.
  • the influence of the first score and the second score on the speech recognition result may be adjusted at any time by adjusting the weights of the first score and the second score, that is, the voice side may be adjusted at any time
  • the proportion of the semantic side can further ensure the reliability of the final score, so that the selected speech recognition result is more accurate.
  • the neural network is a circulating neural network.
  • the cyclic neural network also known as RNNLM (Recurrent Neural Network Based Language Model) can also be called a cyclic network language model.
  • RNNLM Recurrent Neural Network Based Language Model
  • the cyclic network language model can consider multiple words that were previously input, and can calculate the probability of the next word based on the long text composed of the previously entered words, so the cyclic network language model has "better” Memory effect.” For example, “my” or “mood” may appear "good” or "bad”. The appearance of these words depends on the appearance of "my” and “mood” before, and this is the "memory effect.”
  • the circular network language model maps each word entered into a compact continuous vector space, while the mapped continuous vector space uses a relatively small set of parameters and uses loop joins to create the corresponding model, resulting in long-distance context dependencies.
  • a language model commonly used in large-vocal continuous speech recognition namely the NGRAM model (a linguistic statistical model) relies on the first N-1 words of the input, N is a positive integer, and the cyclic network language model can capture the previously input. Historical information for all words. Therefore, compared with the traditional language model, which only depends on the characteristics of the N-1 words input earlier, the cyclic network language model is more accurate in scoring the input candidate word sequences.
  • the method further includes: performing entity extraction on the speech recognition result to obtain an entity word; retrieving the entity word; and performing the retrieval result and the entity When the words are inconsistent, the entity words are repaired.
  • the final score of each candidate word sequence can be calculated, and the candidate word with the highest final score is selected as the speech recognition result of the speech data.
  • the speech recognition result may be physically repaired. First, the speech recognition result is physically extracted.
  • the entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the window of the world", then in this sentence, "the window of the world” is this The focus of the sentence, and "I want to go” is only an expression of the speaker's intention, and the final place to be implemented is the "window of the world.”
  • entity extraction When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and the retrieved result and the speech recognition result are When the extracted entity words are inconsistent, the entity words extracted from the speech recognition result are replaced with the retrieval result, and the entity words are repaired in this way.
  • the recognition of typos is generally concentrated on special entities in the domain. Therefore, the entity in the speech recognition result can be extracted for this problem, and the entity words are correspondingly repaired, thereby reducing the occurrence of speech recognition results.
  • the probability of a typo increases the accuracy of the resulting speech recognition result.
  • retrieving the entity term comprises: determining a domain of the speech recognition result based on the domain in which the candidate word sequence is located; and retrieving each entity term in a database corresponding to the domain of the speech recognition result.
  • a plurality of word sequences can be obtained, a plurality of word sequences are extracted from the word sequence as a candidate word sequence, and one of the candidate word sequences is selected as a speech recognition result of the speech data, and thus can be extracted according to the
  • the field in which the candidate word sequence is located determines the field to which the speech recognition result corresponds.
  • each entity term can be retrieved in a database corresponding to the domain.
  • step 214 the following steps are further included:
  • Step 602 Perform physical extraction on the speech recognition result to obtain an entity word.
  • Step 604 Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
  • Step 606 Search each entity word in a database corresponding to the field of the speech recognition result.
  • Step 608 When the search result is inconsistent with the entity word, the entity word is repaired.
  • the entity of the speech recognition result can be extracted, and the entity represents a large role in the sentence of the speech recognition result.
  • the location is "Window of the World.”
  • the sequence of candidate words may be multiple, and the field corresponding to all candidate word sequences is generally one, that is, multiple candidate word sequences correspond to the same field, and the speech recognition result is the candidate word sequence with the highest final score, and the candidate can be extracted according to the candidate.
  • the field in which the word sequence is located determines the field of speech recognition results.
  • each entity word can be retrieved in a database corresponding to the field of the speech recognition result, that is, each field has its own corresponding database.
  • the speech recognition result is “I want to go to the Emperor Building”, and the entity extraction of this sentence can get the entity word “Imperial Building”, and the field of speech recognition result is “navigation” field, then the entity word “Imperial Building” When searching, it is searched in the database corresponding to the "navigation" field, that is, the corresponding place name is retrieved.
  • the search result is "Diwang Building"
  • the search result is inconsistent with the entity word
  • the entity word "emperor” can be The building is replaced by the search result "Diwang Building”, and the entity words in the speech recognition result are repaired, and the corrected speech recognition result "I want to go to the Diwang Building", that is, the voice can be corrected by the physical repair operation
  • the recognition result is corrected by a typos and the like, and a more accurate speech recognition result is obtained.
  • a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 7, the voice recognition method specifically includes the following steps:
  • step 702 the neural network is trained to obtain a trained neural network.
  • each domain has its own corresponding neural network. Therefore, the corresponding neural network of each domain is trained separately by using the corresponding text of each domain.
  • the trained neural network corresponds to the domain one by one, as shown in Figure 8.
  • Each neural network has its own training text.
  • the semantic classification model can identify the domain in which the extracted candidate word sequence is located. After the domain of the extracted candidate word sequence is determined by the semantic classification model, each candidate word sequence is sequentially input to the corresponding domain. In the neural network, the scoring is performed, so the domain of the neural network corresponds to the domain of the semantic classification model. Therefore, during training, as shown in Fig.
  • the semantic classification model can be trained together with the neural network to make the nerve
  • the domain of the network corresponds to the classification domain of the semantic classification model.
  • the neural network is used to identify the speech, while the semantic classification model is used to analyze the semantics of the speech data. Therefore, the neural network belongs to the recognition side, and the semantic classification model belongs to the semantics. side.
  • the corresponding texts of each field can be obtained.
  • the word vector After converting each word in each text into a word vector, the word vector can be used as input data, and according to the order of the words in the sentence, The word vector of the next word of the input word is used as the output data of the current input word, that is, each output word vector is related to the word vector input at the previous moment and the word vector input at all previous times.
  • the neural network is trained by the word vector corresponding to the input word, the parameters of the neural network can be continuously adjusted until the appropriate parameters are adjusted, and the neural network is trained, and the trained neural network is obtained.
  • the neural network can be a cyclic neural network, also known as a cyclic network language model.
  • the advantage of the cyclic network language model is the long memory effect, that is, the word vector of each output word is influenced by the word vectors of all previously input words.
  • the output data corresponding to the word vector of the currently input word is the word vector corresponding to the previous word.
  • the output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word.
  • the word vector of all the words entered before That is, each output word is related to the input word at the previous moment and the words entered at all previous times.
  • the output data corresponding to each input word vector is set, and the cyclic network language model is trained to obtain the trained neural network for practical use.
  • Step 704 Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the speech decoding process can be performed by the front end, and the front end can be a terminal, and the speech data can be decoded to obtain a plurality of word sequences, and a first score corresponding to each word sequence.
  • the front end extracts the feature data, and uses the trained acoustic model, the language model, and the pronunciation dictionary to perform search and decoding on the extracted voice data. Get multiple word sequences.
  • the acoustic model can be trained through the speech training set, the language model is trained by the text training set, and the trained acoustic model and language model can be put into practical use.
  • a word sequence can be thought of as multiple words and multiple paths. It can also be called lattice.
  • Latice is essentially a directed acyclic graph. Each node on the graph represents the end time of a word. Each edge represents a possible word, as well as the acoustic score and language model score of the word.
  • each node stores the result of the speech recognition at the current position, including the acoustic probability, the language probability, and the like, as shown in FIG.
  • Step 706 Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
  • a predetermined number of word sequences of the first score may be extracted from the word sequence as a candidate word sequence. That is, the optimal path calculated by lattice, that is, the path with the highest probability does not necessarily match the actual word sequence, so the predetermined number of word sequences with the first score before can be extracted as the candidate word sequence, which can also be called N-best (n optimal, a comprehensive fast algorithm for speech feature information), the preset number can be set by the technician according to actual needs.
  • Step 708 input each candidate word sequence into a semantic classification model.
  • Step 710 classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
  • Step 712 Acquire an area corresponding to the classification label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
  • Step 714 Input the candidate word sequence into the trained neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located.
  • Step 716 Re-score each candidate word sequence by the trained neural network to obtain a second score corresponding to each candidate word sequence.
  • Step 718 weighting and summing the first score and the second score corresponding to each candidate word sequence to obtain a final score of each candidate word sequence.
  • step 720 the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • each candidate word sequence After extracting a preset number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model, and the trained semantic classification model can classify each candidate word sequence and output
  • the classification label obtained by classifying each candidate word sequence may determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.
  • the trained semantic classification model may have different domain classification results for each candidate word sequence.
  • the classification label of each candidate word sequence may be obtained, and the classification with the largest proportion may be obtained.
  • the label is a classification label corresponding to all candidate word sequences, that is, the domain corresponding to the largest classification label is the domain of all candidate word sequences. Therefore, the fields corresponding to the plurality of candidate word sequences will eventually be the same, generally not A plurality of candidate word sequences corresponding to the same voice data belong to a plurality of fields.
  • the candidate word sequence After extracting a plurality of candidate word sequences, after performing the n-best extraction step, the candidate word sequence can be identified by the domain, and the semantic classification model can be used to identify the domain in which the candidate word sequence is located. As shown in FIG. 11, after the extraction of n-best is performed, the candidate word sequence can be identified by the semantic classification model, and the candidate word sequence is input into the neural network of the corresponding domain.
  • Each domain has its own corresponding neural network, that is, a neural network that is divided into domains.
  • the neural network can be RNNLM. Compared to the NGRAM language model, RNNLM has better ability to preserve long-term memory.
  • the NGRAM language model is a commonly used language model in large vocabulary continuous speech recognition.
  • the model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words.
  • the probability of a whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words from the corpus.
  • RNNLM is different. For example, "my” and “mood” may appear “good” or “bad”. The appearance of these words depends on the appearance of "my” and “mood”, that is, “memory effect”.
  • Traditional language models such as the NGRAM language model, rely on the first N-1 words when calculating the probability of a word appearing. N is usually set to a maximum of 5, and the previous words are ignored.
  • the candidate word sequence After the candidate word sequence is input to the neural network of the corresponding domain, the candidate word sequence can be re-scored by the neural network to obtain the second score of the candidate word sequence, and finally the speech recognition result of the speech data is obtained.
  • Re-scoring can also be called rescoring. Since the language model that generates the lattice cannot be accurate enough, it is necessary to re-adjust the n-best score using a larger language model. This process is called rescoring, which is a re-scoring.
  • the plurality of candidate word sequences and the corresponding first scores are: playing Jay Chou's nunchaku 94.7456, playing Jay Chou's nunchaku 95.3976, and playing the week conclusion of the nunchaku 94.7795, ...
  • These candidate word sequences are input into the semantic classification model for classification.
  • the semantic classification model recognizes that the candidate word sequences belong to the music field, and then each candidate word sequence can be input into the corresponding neural network of the music field for re-scoring processing, and
  • the second score of each candidate word sequence is obtained, specifically: playing Jay Chou's nunchaku 93.3381, playing Jay Chou's nunchaku 95.0925, playing the week conclusion of the nunchaku 94.1557, ....
  • the first score and the second score may be weighted and summed to obtain a final score for each candidate word sequence.
  • the final score of each candidate word sequence first score * first weight + second score * second weight.
  • the first weight and the second weight may be equal or unequal, and the first weight and the second weight respectively represent the proportion of each of the first score and the second score, and when the first weight is greater than the second weight, the technology may be considered. The person increases the score of the first score, and thinks that the proportion of the first score is higher than the second score, that is, the technician thinks that the first score has a greater influence on the speech recognition result.
  • the first weight and the second weight may be determined by a large amount of test data, such as using a large number of candidate word sequences, and continuously adjusting the values of the first weight and the second weight until the first weight and the second weight are to be calculated.
  • the sequence of the candidate word with the highest final score is indeed the most accurate speech recognition result, and the actual values of the first weight and the second weight can be determined, and This first weight and the second weight are used in the actual application as the calculation parameters of the final score.
  • a calculation step can also be added, such as taking the logarithm of the final score as the selection score, that is, the final score of the candidate sequence "playing Jay Chou's nunchaku" is 95.21454, and then taking the pair of 95.21454.
  • log 10 95.21454 after calculating the logarithm of the final score of each candidate word sequence, the logarithm of the final score is taken as the selection score of each candidate word sequence, and then the candidate word sequence with the highest selection score is selected as the speech. Identify the results.
  • This calculation method can be diversified, and other calculation methods can be adopted, which can be set by a technician.
  • Step 722 Perform physical extraction on the speech recognition result to obtain an entity word.
  • Step 724 Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
  • Step 726 searching for each entity word in a database corresponding to the field of the speech recognition result.
  • Step 728 When the search result is inconsistent with the entity word, the entity word is repaired.
  • the speech recognition result can be physically repaired after the speech recognition result of the speech data is obtained.
  • the speech recognition result is physically extracted.
  • the entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the emperor building", then in this sentence, "the emperor building” is the sentence Focus, and "I want to go” is just an expression of the speaker's intention, and the final place to be implemented is the "Emperor Building.”
  • entity extraction When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and when the voice data is decoded, more can be obtained. a sequence of words, selecting a plurality of word sequences as a candidate word sequence from the sequence of words, and selecting one of the candidate word sequences as the speech recognition result of the speech data, so that the speech recognition result corresponding to the field in which the candidate word sequence is located may be determined. field.
  • each entity word may be searched in a corresponding database in the field, and when the retrieved result is inconsistent with the entity word extracted in the speech recognition result, the speech is The entity words extracted in the recognition result are replaced with the search results, and the entity words are repaired in this way.
  • the speech recognition result is "I want to go to the Emperor Building”
  • the "Emperor Building” can be searched according to the navigation field to which the speech recognition result belongs, that is, the geographical location information is retrieved, according to The search reveals that the actual result should be “Diwang Building”, so the whole sentence can be corrected to “I want to go to the King Building”, and the physical repair function can quickly and accurately repair the specific identification errors.
  • the re-scoring operation after the sub-field can greatly improve the recognition accuracy. It is proved by experiments that the speech recognition accuracy rates are different in different fields. With a relative increase of 4% to 16%, the overall recognition accuracy increased by 6%. In addition, since each domain has its own corresponding neural network, the cost of training the neural network can be greatly reduced.
  • the traditional speech recognition scheme needs to use a large amount of text to train a language model. In order to ensure the effect of re-scoring, the language model is very large, the model training period is very long, and the evaluation of the overall effect of the model is difficult.
  • 8.2G text training RNNLM The time is about 100 hours; after the field is divided, the training time of the model in a single field can be shortened to 24 hours, and the update time for the domain-specific model is greatly shortened.
  • the language model in the traditional technology cannot cover all the data comprehensively. Once there is a problem in the language model, the entire language model needs to be updated, and each update will generate huge overhead, but in this embodiment, It is only necessary to retrain the neural network corresponding to the domain that needs to be processed or updated, which reduces the training cost and shortens the time required for the neural network update.
  • FIGS. 2 to 12 are schematic flowcharts of a voice recognition method in various embodiments. It should be understood that although the various steps in the flowcharts of the various figures are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the various figures may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps.
  • a voice recognition apparatus including:
  • the word sequence obtaining module 1302 is configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the extracting module 1304 is configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.
  • the domain identification module 1306 is configured to identify the domain in which the extracted candidate word sequence is located.
  • the re-scoring module 1308 is configured to input each candidate word sequence into the neural network of the corresponding domain according to the field; each candidate word sequence is re-scoreed by the neural network to obtain a second score corresponding to each candidate word sequence.
  • the speech recognition result determining module 1310 is configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; and use the candidate word sequence with the highest final score as the speech recognition result of the speech data.
  • the domain identification module 1306 includes:
  • the input module 1306A is configured to input each candidate word sequence into the trained semantic classification model.
  • the classification module 1306B is configured to classify each candidate word sequence by a semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain an area corresponding to the classification label with the largest proportion among the classification labels as the extracted candidate word sequence. field of.
  • the apparatus further includes a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.
  • a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.
  • the training module is further configured to input, for each field, a word vector corresponding to each word in the text according to the order of the words in the text corresponding to the domain, and input the next word of each input word.
  • the word vector corresponding to the word is used as an output to train the neural network by adjusting the parameters of the corresponding neural network in the field.
  • the re-scoring module 1308 is further configured to perform weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
  • the neural network is a circulating neural network.
  • the apparatus further includes a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.
  • a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.
  • the entity repairing module is further configured to determine a domain of the voice recognition result according to the domain in which the extracted candidate word sequence is located; and retrieve each entity word in a database corresponding to the domain of the voice recognition result.
  • Figure 15 is a diagram showing the internal structure of a computer device in one embodiment.
  • the computer device may specifically be the server 120 of FIG.
  • the computer device includes the computer device including a processor, a memory, and a network interface connected by a system bus.
  • the memory comprises a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a speech recognition method.
  • the internal memory can also store a computer program that, when executed by the processor, causes the processor to perform the speech recognition method described above.
  • FIG. 15 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • the speech recognition apparatus can be implemented in the form of a computer program that can be run on a computer device as in FIG.
  • the program modules constituting the voice recognition device may be stored in a memory of the computer device, such as the word sequence acquisition module, the extraction module, the domain identification module, the re-scoring module, and the speech recognition result determination module shown in FIG.
  • the computer program of each program module causes the processor to perform the steps in the speech recognition method of the various embodiments of the present application described in this specification.
  • the computer device shown in FIG. 15 can perform acquisition of a plurality of word sequences obtained by decoding the voice data by the word sequence acquisition module in the voice recognition device as shown in FIG. 13, and a first score corresponding to each word sequence.
  • the computer device may perform, by the extraction module, extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.
  • the computer device can perform an area identifying the extracted candidate word sequence by the domain identification module.
  • the computer device may perform, by the re-scoring module, input each candidate word sequence into the neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located; re-score each candidate word sequence through the neural network to obtain each candidate word The second score corresponding to the sequence.
  • the computer device may obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence by the voice recognition result determining module; the candidate word sequence with the highest final score as the voice recognition result of the voice data .
  • a computer apparatus comprising a memory and a processor having stored therein a computer program that, when executed, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un procédé et un appareil de reconnaissance vocale, ainsi qu'un support de stockage lisible par ordinateur et un dispositif informatique. Le procédé consiste à : obtenir une pluralité de séquences de mots en décodant des données vocales, ainsi qu'un premier score correspondant à chaque séquence de mots ; extraire, de la pluralité de séquences de mots, une quantité prédéfinie de séquences de mots dont les premiers scores sont classés en tant que séquences de mots candidates ; pour les séquences de mots candidates extraites, reconnaître le champ où se trouvent les séquences de mots candidates ; entrer les séquences de mots candidates dans un réseau neuronal du champ correspondant en fonction du champ où se trouvent les séquences de mots candidates ; réévaluer les séquences de mots candidates au moyen du réseau neuronal afin d'obtenir des seconds scores ; obtenir les scores finaux en fonction des premiers scores et des seconds scores correspondant aux séquences de mots candidates ; et utiliser, dans les séquences de mots candidates extraites, la séquence de mots candidate ayant le score final le plus élevé comme résultat de reconnaissance vocale des données vocales. De cette manière, la durée d'apprentissage d'un champ spécifique du réseau neuronal est considérablement raccourcie, et l'efficacité de la reconnaissance vocale est encore améliorée.
PCT/CN2019/082300 2018-05-14 2019-04-11 Procédé et appareil de reconnaissance vocale, support de stockage lisible par ordinateur et dispositif informatique WO2019218818A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810457129.3 2018-05-14
CN201810457129.3A CN108711422B (zh) 2018-05-14 2018-05-14 语音识别方法、装置、计算机可读存储介质和计算机设备

Publications (1)

Publication Number Publication Date
WO2019218818A1 true WO2019218818A1 (fr) 2019-11-21

Family

ID=63869029

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/082300 WO2019218818A1 (fr) 2018-05-14 2019-04-11 Procédé et appareil de reconnaissance vocale, support de stockage lisible par ordinateur et dispositif informatique

Country Status (2)

Country Link
CN (1) CN108711422B (fr)
WO (1) WO2019218818A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554276A (zh) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN111554275A (zh) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN112885336A (zh) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 语音识别系统的训练、识别方法、装置、电子设备
CN113689888A (zh) * 2021-07-30 2021-11-23 浙江大华技术股份有限公司 一种异常声音分类方法、系统、装置以及存储介质
CN113793597A (zh) * 2021-09-15 2021-12-14 云知声智能科技股份有限公司 一种语音识别方法、装置、电子设备和存储介质

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108711422B (zh) * 2018-05-14 2023-04-07 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机可读存储介质和计算机设备
CN110176230B (zh) * 2018-12-11 2021-10-08 腾讯科技(深圳)有限公司 一种语音识别方法、装置、设备和存储介质
CN111475129A (zh) * 2019-01-24 2020-07-31 北京京东尚科信息技术有限公司 一种语音识别候选同音词的展示方法及设备
CN110164020A (zh) 2019-05-24 2019-08-23 北京达佳互联信息技术有限公司 投票创建方法、装置、计算机设备及计算机可读存储介质
CN110534100A (zh) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 一种基于语音识别的中文语音校对方法和装置
CN110797026A (zh) * 2019-09-17 2020-02-14 腾讯科技(深圳)有限公司 一种语音识别方法、装置及存储介质
CN112992127B (zh) * 2019-12-12 2024-05-07 杭州海康威视数字技术股份有限公司 一种语音识别的方法和装置
CN110942775B (zh) * 2019-12-20 2022-07-01 北京欧珀通信有限公司 数据处理方法、装置、电子设备及存储介质
CN111179916B (zh) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 重打分模型训练方法、语音识别方法及相关装置
CN111508478B (zh) * 2020-04-08 2023-04-11 北京字节跳动网络技术有限公司 语音识别方法和装置
CN111651599B (zh) * 2020-05-29 2023-05-26 北京搜狗科技发展有限公司 一种语音识别候选结果的排序方法及装置
CN112259084A (zh) * 2020-06-28 2021-01-22 北京沃东天骏信息技术有限公司 语音识别方法、装置和存储介质
CN112669845B (zh) * 2020-12-25 2024-04-12 竹间智能科技(上海)有限公司 语音识别结果的校正方法及装置、电子设备、存储介质
CN112802476B (zh) * 2020-12-30 2023-10-24 深圳追一科技有限公司 语音识别方法和装置、服务器、计算机可读存储介质
CN112802461B (zh) * 2020-12-30 2023-10-24 深圳追一科技有限公司 语音识别方法和装置、服务器、计算机可读存储介质
CN113539272A (zh) * 2021-09-13 2021-10-22 腾讯科技(深圳)有限公司 一种语音识别方法、装置、存储介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN102376305A (zh) * 2011-11-29 2012-03-14 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
CN106803422A (zh) * 2015-11-26 2017-06-06 中国科学院声学研究所 一种基于长短时记忆网络的语言模型重估方法
CN108711422A (zh) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机可读存储介质和计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN102376305A (zh) * 2011-11-29 2012-03-14 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN106803422A (zh) * 2015-11-26 2017-06-06 中国科学院声学研究所 一种基于长短时记忆网络的语言模型重估方法
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
CN108711422A (zh) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机可读存储介质和计算机设备

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554276A (zh) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN111554275A (zh) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN111554275B (zh) * 2020-05-15 2023-11-03 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN111554276B (zh) * 2020-05-15 2023-11-03 深圳前海微众银行股份有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN112885336A (zh) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 语音识别系统的训练、识别方法、装置、电子设备
CN112885336B (zh) * 2021-01-29 2024-02-02 深圳前海微众银行股份有限公司 语音识别系统的训练、识别方法、装置、电子设备
CN113689888A (zh) * 2021-07-30 2021-11-23 浙江大华技术股份有限公司 一种异常声音分类方法、系统、装置以及存储介质
CN113793597A (zh) * 2021-09-15 2021-12-14 云知声智能科技股份有限公司 一种语音识别方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN108711422B (zh) 2023-04-07
CN108711422A (zh) 2018-10-26

Similar Documents

Publication Publication Date Title
WO2019218818A1 (fr) Procédé et appareil de reconnaissance vocale, support de stockage lisible par ordinateur et dispositif informatique
US10216725B2 (en) Integration of domain information into state transitions of a finite state transducer for natural language processing
US10176804B2 (en) Analyzing textual data
US8959014B2 (en) Training acoustic models using distributed computing techniques
CN107480143B (zh) 基于上下文相关性的对话话题分割方法和系统
US7562014B1 (en) Active learning process for spoken dialog systems
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
WO2020186712A1 (fr) Procédé et appareil de reconnaissance vocale, et terminal
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
JP2019159654A (ja) 時系列情報の学習システム、方法およびニューラルネットワークモデル
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
CN104750677A (zh) 语音传译装置、语音传译方法及语音传译程序
CN115688779A (zh) 一种基于自监督深度学习的地址识别方法
Yusuf et al. Low resource keyword search with synthesized crosslingual exemplars
JP5975938B2 (ja) 音声認識装置、音声認識方法及びプログラム
George et al. Unsupervised query-by-example spoken term detection using segment-based bag of acoustic words
CN113468311B (zh) 一种基于知识图谱的复杂问句问答方法、装置及存储介质
CN115600595A (zh) 一种实体关系抽取方法、系统、设备及可读存储介质
CN113342953A (zh) 一种基于多模型集成的政务问答方法
Zhang et al. Open-domain document-based automatic QA models based on CNN and attention mechanism
CN111090720A (zh) 一种热词的添加方法和装置
US20240144911A1 (en) Abbreviation classification for speech synthesis and recognition
Zhou et al. Research on matching method in humming retrieval
WO2022074760A1 (fr) Dispositif de traitement de données, procédé de traitement de données et programme de traitement de données
Manghat et al. Few-shot meta multilabel classifier for low resource accented code-switched speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19803470

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19803470

Country of ref document: EP

Kind code of ref document: A1