WO2023273610A1 - 语音识别方法、装置、介质及电子设备 - Google Patents

语音识别方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2023273610A1
WO2023273610A1 PCT/CN2022/091477 CN2022091477W WO2023273610A1 WO 2023273610 A1 WO2023273610 A1 WO 2023273610A1 CN 2022091477 W CN2022091477 W CN 2022091477W WO 2023273610 A1 WO2023273610 A1 WO 2023273610A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
probability
audio frame
acoustic vector
text
Prior art date
Application number
PCT/CN2022/091477
Other languages
English (en)
French (fr)
Inventor
董林昊
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023273610A1 publication Critical patent/WO2023273610A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech recognition method, device, medium and electronic equipment.
  • the speech recognition can be performed by performing sequence alignment mapping through an alignment algorithm.
  • the model is usually trained by multi-task learning.
  • the knowledge accumulated in the multi-task learning during the training process cannot be utilized. Based on It is difficult for the model to perform speech recognition with the expected accuracy.
  • the present disclosure provides a speech recognition method, the method comprising:
  • the information amount sequence and the first probability sequence corresponding to the speech data are obtained, wherein the information amount sequence includes the information amount of each audio frame, and the first The probability sequence includes a first text probability distribution of each predicted character corresponding to the speech data;
  • the target probability sequence includes a target text probability distribution for each of the predicted characters
  • the target text corresponding to the speech data is determined.
  • the obtaining an information sequence and a first probability sequence corresponding to the speech data according to the acoustic vector sequence and the first prediction model includes:
  • the obtaining a second probability sequence according to the acoustic vector sequence and the second prediction model includes:
  • the probability corresponding to the preset character in the predicted probability distribution of the audio frame is deleted, and the predicted probability distribution obtained after deletion is normalized to obtain the text probability distribution of the audio frame.
  • the determining a target probability sequence according to the first probability sequence and the second probability sequence includes:
  • the target probability sequence is determined according to the first probability sequence and the third probability sequence.
  • the merging the text probability distributions of the audio frames in the second probability sequence according to the information amount sequence to obtain a third probability sequence includes:
  • the weighted sum of the text probability distributions of each audio frame in the audio frame combination is determined as the second text probability distribution of the predicted characters corresponding to the group of audio frame combinations, wherein each of the The weight corresponding to the audio frame is determined based on the amount of information that the audio frame belongs to the audio frame combination.
  • the determining the target probability sequence according to the first probability sequence and the third probability sequence includes:
  • the weighted sum of the first text probability distribution of the predicted character in the first probability sequence and the second text probability distribution of the predicted character in the third probability sequence is determined. is the target probability distribution of the predicted character.
  • the first prediction model is a CIF model
  • the second prediction model is a CTC model
  • a speech recognition device comprising:
  • An encoding module configured to encode the received speech data to obtain an acoustic vector sequence corresponding to the speech data, wherein the acoustic vector sequence includes an acoustic vector of each audio frame of the speech data;
  • a first processing module configured to obtain an information sequence and a first probability sequence corresponding to the speech data according to the acoustic vector sequence and the first prediction model, wherein the information sequence includes each of the audio frames
  • the amount of information, the first probability sequence includes a first text probability distribution of each predicted character corresponding to the voice data
  • a second processing module configured to obtain a second probability sequence according to the acoustic vector sequence and a second prediction model, wherein the second probability sequence includes a text probability distribution of each of the audio frames;
  • a first determining module configured to determine a target probability sequence according to the first probability sequence and the second probability sequence, wherein the target probability sequence includes a target text probability distribution of each of the predicted characters;
  • the second determination module is configured to determine the target text corresponding to the voice data according to the target probability sequence.
  • a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in the first aspect are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect.
  • the received speech data is encoded to obtain the acoustic vector sequence corresponding to the speech data, and then the first probability can be respectively obtained based on the acoustic vector sequence and the first prediction model and the second prediction model sequence and the second probability sequence, and then the first probability sequence and the second probability sequence can be combined to obtain a comprehensively considered target probability sequence, so as to determine the target text corresponding to the voice data according to the target probability sequence.
  • the target probability sequence for speech recognition can be determined based on the probability sequences respectively output by multiple prediction models corresponding to the multi-task learning in the training process, and the accumulated multi-task learning can be based on the training process. Speech recognition and decoding based on knowledge can significantly improve the accuracy and efficiency of speech recognition and improve user experience.
  • FIG. 1 is a flowchart of a speech recognition method provided according to an embodiment of the present disclosure
  • Fig. 2 is a flow chart of an exemplary implementation of obtaining an information amount sequence and a first probability sequence corresponding to speech data according to an acoustic vector sequence and a first prediction model;
  • Fig. 3 is a block diagram of a speech recognition device provided according to an embodiment of the present disclosure.
  • FIG. 4 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a speech recognition method provided according to an embodiment of the present disclosure. As shown in Figure 1, the method may include:
  • step 11 the received speech data is encoded to obtain an acoustic vector sequence corresponding to the speech data, wherein the acoustic vector sequence includes an acoustic vector of each audio frame of the speech data.
  • the received voice data can be encoded by a shared encoder obtained through pre-training, so that the voice data can be converted into an acoustic vector representation, that is, an acoustic vector sequence can be obtained.
  • the voice data per second can be divided into multiple audio frames, so as to perform data processing based on the audio frames.
  • the voice data per second can be divided into 100 audio frames for processing.
  • the audio frame of the speech data is encoded by the shared encoder, and the obtained acoustic vector sequence H can be expressed as:
  • H ⁇ H 1 ,H 2 ,...,H U ⁇ , where U is used to represent the number of audio frames from the beginning of the speech to the end of the speech in the speech data, that is, the length of the acoustic vector sequence.
  • step 12 according to the acoustic vector sequence and the first prediction model, the information amount sequence and the first probability sequence corresponding to the voice data are obtained, wherein the information amount sequence includes the information amount of each audio frame, and the first A probability sequence includes a first text probability distribution of each predicted character corresponding to the speech data.
  • the voice data per second can be divided into 100 audio frames for processing, and the amount of information corresponding to each audio frame can be used to represent the amount of information contained in the audio frame.
  • the amount of information contained in each predicted character is the same, then for the amount of information of each audio frame, which audio frames can be determined by accumulating the amount of information in a sequence from left to right
  • the first probability sequence can be obtained based on the acoustic vector of each predicted character.
  • the first prediction model may be a CIF (Continuous Integrate-and-Fire, continuous integration and distribution) model
  • the determined information sequence W may be expressed as follows:
  • the first probability sequence P * can be expressed as follows:
  • a second probability sequence is obtained according to the acoustic vector sequence and the second prediction model, wherein the second probability sequence includes the text probability distribution of each of the audio frames.
  • the second prediction model may be CTC (Connectionist temporal classification), which may be understood as neural network-based temporal classification.
  • the shared encoder, the first predictive model and the second predictive model can be trained respectively, so that the above-mentioned model can obtain the acoustic vector sequence, the information amount sequence, the first probability sequence and the second probability sequence respectively through the training.
  • the shared encoder, the first prediction model and the second prediction model can be jointly trained end-to-end, such as inputting the training data into the shared encoder, and inputting the vectors output by the shared encoder into the first prediction model respectively and the second prediction model, the output of the model is obtained by decoding the output of the first prediction model, and the end-to-end model is realized by multi-task learning based on the losses of the first prediction model and the second prediction model.
  • Model training can be obtained through the above-mentioned end-to-end training method, ensuring the matching of model parameters between the shared encoder, the first prediction model and the second prediction model Spend.
  • a target probability sequence is determined according to the first probability sequence and the second probability sequence, wherein the target probability sequence includes a target text probability distribution of each of the predicted characters.
  • the first probability sequence contains the probability distribution corresponding to the speech data determined based on the first prediction model
  • the second probability sequence contains the probability distribution corresponding to the speech data determined based on the second prediction model
  • step 15 the target text corresponding to the speech data is determined according to the target probability sequence.
  • the target probability sequence includes a target text probability distribution corresponding to each predicted character.
  • the word with the highest probability can be determined as the recognition of the predicted character for the target text probability distribution of the first predicted character characters, and then for the target text probability distribution of the second predicted character and subsequent predicted characters, the corresponding recognition characters are determined in the same way, so as to generate the target text according to the recognition characters.
  • the target text probability distribution of the first predicted character can be ranked in order of probability from large to small N words are used as the candidate recognition characters of the predicted character, and then the target text probability distribution of the second predicted character is combined with the probability corresponding to the previous candidate recognition characters to determine the N candidate recognition characters corresponding to the second predicted character , and so on for subsequent prediction characters, so as to determine the target text with the highest probability corresponding to the entire speech data.
  • Beam Search Beam Search
  • the received speech data is encoded to obtain the acoustic vector sequence corresponding to the speech data, and then the first probability can be respectively obtained based on the acoustic vector sequence and the first prediction model and the second prediction model sequence and the second probability sequence, and then the first probability sequence and the second probability sequence can be combined to obtain a comprehensively considered target probability sequence, so as to determine the target text corresponding to the voice data according to the target probability sequence.
  • the target probability sequence for speech recognition can be determined based on the probability sequences respectively output by multiple prediction models corresponding to the multi-task learning in the training process, and the accumulated multi-task learning can be based on the training process. Speech recognition and decoding based on knowledge can significantly improve the accuracy and efficiency of speech recognition and improve user experience.
  • step 12 according to the acoustic vector sequence and the first prediction model, an exemplary implementation manner of obtaining the information amount sequence and the first probability sequence corresponding to the speech data is as follows, as shown in FIG. 2 , This step can include:
  • step 21 the acoustic vector sequence is input into the first prediction model to obtain the information amount sequence.
  • the acoustic vector sequence may be input into a first prediction model, and then the first prediction model performs information amount prediction on each acoustic vector in the acoustic vector sequence.
  • the window centered on the acoustic vector Hu of the audio frame can be input to the one-dimensional convolutional layer and then input to the fully connected sigmoid activation layer to the output unit to obtain the entropy Wu of the audio frame to obtain the entropy sequence.
  • step 22 the acoustic vectors of the audio frames in the acoustic vector sequence are combined according to the information amount sequence to obtain a character acoustic vector sequence, wherein the character acoustic vector sequence includes an acoustic vector corresponding to each predicted character.
  • the information volume corresponding to each predicted character is the same by default, so in the embodiment of the present disclosure, the information volume in the information volume sequence corresponding to the audio frame can be arranged from left to right
  • the preset threshold may be set according to actual application scenarios and experience, for example, the preset threshold may be set to 1, which is not limited in the present disclosure.
  • the acoustic vectors of the audio frames in the acoustic vector sequence may be combined according to the information amount sequence in the following manner:
  • sequence sequence in the information amount sequence sequentially acquire the information amount W i of an audio frame i ;
  • the information volume of the second audio frame can be divided into two parts, that is, a part of the information volume belongs to the current predicted character, and the remaining part of the information volume belongs to the next predicted character.
  • Then continue to traverse the information volume of the audio frame, and continue to accumulate the information volume from the information volume of the remaining part of the second audio frame, that is, the information volume W 22 in the second audio frame and the information volume in the third audio frame.
  • the information amount W3 is accumulated until it reaches the preset threshold ⁇ , and the audio frame corresponding to the next predicted character is obtained.
  • the amount of information of the subsequent audio frames can be deduced by analogy, and combined in the above manner to obtain each predicted character corresponding to the plurality of audio frames.
  • the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character can be determined as the corresponding Acoustic vector.
  • the weight of the acoustic vector of each audio frame corresponding to the predicted character is the corresponding information amount of the audio frame in the predicted character. If the audio frame all belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information amount of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the audio frame The amount of information in this section.
  • the acoustic vector C1 corresponding to the predicted character can be expressed as:
  • the acoustic vector C2 corresponding to the predicted character can be expressed as:
  • the acoustic vectors of the audio frames in the acoustic vector sequence are combined according to the information amount sequence to obtain a character acoustic vector sequence, so as to process each predicted character.
  • step 23 the character acoustic vector sequence is decoded to obtain a first probability sequence.
  • the character acoustic vector corresponding to each predicted character can be obtained by the method shown above, and then the character acoustic vector can be decoded based on the decoder, so as to obtain the first text probability distribution corresponding to each predicted character, that is, the Predict the probability of character recognition as each candidate character.
  • the acoustic vectors of the audio frames can be combined based on the amount of information of each audio frame to obtain the character acoustic vector corresponding to each predicted character, and the voice data corresponding to the magnitude of the audio frame can be It is mapped to character-level representation, so that it can be applied to the speech recognition scene of speech data of any length, and expands the use range of the speech recognition method.
  • the processing efficiency of the speech recognition algorithm can be improved on the basis of simplifying the speech recognition method , providing effective data support for subsequent character determination.
  • step 13 according to the acoustic vector sequence and the second prediction model, an exemplary implementation manner of obtaining the second probability sequence is as follows, and this step may include:
  • the second prediction model can be a CTC model, in which a text sequence of any length can be determined for an acoustic vector sequence of a given length, in the prediction model, there will be an input acoustic vector sequence corresponding to corresponds to an alignment sequence of the same length, through which the alignment sequence is mapped to a text sequence.
  • the acoustic vector sequence may be mapped to the probability distribution of each dimension before the alignment sequence to determine the predicted probability distribution of the audio frame of the dimension.
  • a null character is introduced in the CTC model, which has no meaning and will be removed when it is mapped to the output text sequence , when merging repeated characters in the CTC model, the continuous repeated characters between the empty characters will be merged, and the repeated characters separated by the empty characters will not be merged, thereby ensuring the accuracy of the recognized text obtained by speech recognition .
  • the prediction probability distribution in the second prediction model can be processed in the following manner to ensure that the first prediction model and the second The uniformity of the prediction results of the prediction model.
  • the probability corresponding to the null character in the probability distribution corresponding to the audio frame may be deleted, so as to retain the probability distribution of the real character corresponding to the audio frame.
  • its corresponding probability distribution is not necessarily the same. Therefore, the predicted probability distribution after the probability of the preset character is deleted can be normalized.
  • the predicted probability distribution for an audio frame K is ⁇ :p 1 ; s 1 :p 2 ; s 2 :p 3 ;,,,,;s n ⁇ 1 :p n ⁇
  • the cumulative sum of p 1 , p 2 to p n is 1, each audio frame corresponds to n character dimensions, and the n characters contain a null character ⁇ and n-1 real characters, then the predicted probability ⁇ :P 1 in the distribution is deleted, and the remaining probabilities corresponding to real characters are normalized:
  • the second probability sequence P' can be obtained as follows: P': ⁇ P' 1 , P' 2 , . . . , P' n ⁇ .
  • the predicted probability distribution corresponding to each audio frame can be obtained through the second prediction model, and at the same time, the predicted probability distribution can be processed to delete the corresponding invalid characters to obtain each audio frame.
  • the frame corresponds to the probability distribution of real characters, ensuring the consistency with the corresponding characters in the first probability distribution obtained by the first prediction model, and improving the unified standard for subsequent speech recognition based on the first probability sequence and the second probability sequence, ensuring The baseline of speech recognition is the same, which can improve the accuracy of speech recognition to a certain extent.
  • an exemplary implementation manner of determining the target probability sequence according to the first probability sequence and the second probability sequence is as follows, and this step may include:
  • the first probability sequence contains the first text probability distribution of each of the predicted characters
  • the second probability sequence contains the text probability distribution of each audio frame, in order to make the two unified into the same magnitude representation
  • the text probability distribution of the audio frame in the second probability sequence can be combined based on the information amount sequence, so as to convert the second probability sequence into a probability distribution corresponding to the magnitude of the predicted character, that is, The third probability sequence.
  • an exemplary implementation manner of merging the text probability distributions of the audio frames in the second probability sequence according to the information amount sequence to obtain a third probability sequence is as follows, and this step may include:
  • each audio frame belonging to the same audio frame combination can be determined according to the above-mentioned manner of merging the acoustic vectors of the audio frames in the acoustic vector sequence according to the information amount sequence, which will not be repeated here.
  • the weighted sum of the text probability distributions of each audio frame in the audio frame combination is determined as the second text probability distribution of the predicted characters corresponding to the group of audio frame combinations, wherein each of the The weight corresponding to the audio frame is determined based on the amount of information that the audio frame belongs to the audio frame combination.
  • the second text probability distribution of the predicted characters corresponding to the audio frame combination may be determined according to the weight corresponding to each audio frame in the audio frame combination.
  • the weight corresponding to the audio frame may be the corresponding information amount of the audio frame in the audio frame combination to which it belongs, that is, the corresponding information amount in the predicted character mentioned above.
  • the weight determination method when all or part of an audio frame belongs to an audio frame combination has been described in detail above, and will not be repeated here.
  • the second text probability distribution P # 1 of the predicted character corresponding to the audio frame combination can be determined in the following manner:
  • the second text probability distribution P # 2 of the predicted character corresponding to the audio frame combination can be expressed as:
  • P # 2 W 22 *P' 2 +W 3 *P' 3 .
  • the text probability distribution of the audio frame in the second probability sequence can be combined based on the information amount sequence, and the probability distribution of the audio frame level can be converted into a predicted character amount through the information amount of each audio frame Level probability distribution, realizing the conversion from audio frame to predicted character, applicable to the speech recognition process of voice data of any length, ensuring the accuracy and reliability of converting audio frame sequence to character sequence, and ensuring the accuracy of the third probability sequence, Therefore, reliable data support is improved for subsequent determination of the target probability sequence and speech recognition.
  • the target probability sequence is determined according to the first probability sequence and the third probability sequence.
  • the determined first probability sequence and the third probability sequence are distributions for text prediction for each predicted character, so that a comprehensive distribution can be determined based on two probability distributions corresponding to the same magnitude , can not only include the information content-related features of each audio frame determined by the first prediction model, but also include the text probability distribution of each audio frame determined by the second prediction model, so as to ensure the comprehensiveness of the features in the target probability sequence.
  • an exemplary implementation manner of determining the target probability sequence according to the first probability sequence and the third probability sequence is as follows, and this step may include:
  • the weighted sum of the first text probability distribution of the predicted character in the first probability sequence and the second text probability distribution of the predicted character in the third probability sequence is determined. is the target probability distribution of the predicted character.
  • the interpolation calculation is performed for the first text probability distribution and the second text probability distribution of each predicted character output, and for each predicted character i, there are:
  • the text prediction probability determined for the character level and the text prediction probability determined for the audio frame level can be combined to perform speech recognition and decoding.
  • Introducing the knowledge accumulated in multi-task learning in the training process on the one hand, can significantly improve the accuracy of speech recognition with a low amount of calculation, and on the other hand, can ensure the unity of knowledge in the speech recognition process and the training process. Ensure the matching between the accuracy of speech recognition based on the trained model and the accuracy of training, and further improve the efficiency of speech recognition and user experience.
  • the present disclosure also provides a speech recognition device, as shown in FIG. 3 , the device 10 includes:
  • the encoding module 100 is configured to encode the received speech data to obtain an acoustic vector sequence corresponding to the speech data, wherein the acoustic vector sequence includes an acoustic vector of each audio frame of the speech data;
  • the first processing module 200 is configured to obtain an information sequence and a first probability sequence corresponding to the speech data according to the acoustic vector sequence and the first prediction model, wherein the information sequence includes each of the audio frames The amount of information, the first probability sequence includes the first text probability distribution of each predicted character corresponding to the speech data;
  • the second processing module 300 is configured to obtain a second probability sequence according to the acoustic vector sequence and the second prediction model, wherein the second probability sequence includes a text probability distribution of each of the audio frames;
  • the first determining module 400 is configured to determine a target probability sequence according to the first probability sequence and the second probability sequence, wherein the target probability sequence includes a target text probability distribution of each of the predicted characters;
  • the second determination module 500 is configured to determine the target text corresponding to the voice data according to the target probability sequence.
  • the first processing module includes:
  • a first input submodule configured to input the acoustic vector sequence into the first prediction model to obtain the information amount sequence
  • the first merging submodule is used to merge the acoustic vectors of the audio frames in the acoustic vector sequence according to the information amount sequence to obtain a character acoustic vector sequence, wherein the character acoustic vector sequence includes each of the Predict the acoustic vector corresponding to the character;
  • the decoding submodule is configured to decode the character acoustic vector sequence to obtain the first probability sequence.
  • the second processing module includes:
  • a second input submodule configured to input the acoustic vector sequence into the second prediction model to obtain a prediction probability distribution of each audio frame
  • the processing submodule is used to delete the probability corresponding to the preset character in the predicted probability distribution of the audio frame for each audio frame, and normalize the predicted probability distribution obtained after deletion to obtain the audio frame text probability distribution.
  • the second determination module includes:
  • the second merging submodule is configured to merge the text probability distributions of the audio frames in the second probability sequence according to the information amount sequence to obtain a third probability sequence, wherein the third probability sequence includes each The second text probability distribution of the predicted character;
  • a first determining submodule configured to determine the target probability sequence according to the first probability sequence and the third probability sequence.
  • the second merging submodule includes:
  • the grouping sub-module is used to traverse the information amount in the information amount sequence in sequence order, group the audio frames according to the accumulation sum of the information amount, and obtain a plurality of audio frame combinations, wherein, except for the last audio frame
  • the accumulative sum of the amount of information corresponding to other audio frame combinations other than the combination is the same, and each audio frame combination corresponds to a predicted character;
  • the second determining submodule is used to determine, for each audio frame combination, the weighted sum of the text probability distribution of each audio frame in the audio frame combination as the second text probability of the predicted character corresponding to the group of audio frame combinations distribution, wherein the weight corresponding to each audio frame is determined based on the amount of information that the audio frame belongs to the audio frame combination.
  • the first determining submodule includes:
  • the weighted sum of the first text probability distribution of the predicted character in the first probability sequence and the second text probability distribution of the predicted character in the third probability sequence is determined. is the target probability distribution of the predicted character.
  • the first prediction model is a CIF model
  • the second prediction model is a CTC model
  • FIG. 4 it shows a schematic structural diagram of an electronic device (such as the terminal device or server in FIG. 1 ) 600 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 4 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication means 609 may allow the electronic device 600 to perform wireless or wired communication with other devices to exchange data. While FIG. 4 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future-developed network protocols such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium (eg, communication network) interconnections.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks ("LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: encodes the received voice data, and obtains an acoustic vector corresponding to the voice data Sequence, wherein, the acoustic vector sequence includes the acoustic vector of each audio frame of the speech data; according to the acoustic vector sequence and the first prediction model, obtain the information amount sequence and the first probability sequence corresponding to the speech data , wherein, the information amount sequence includes the information amount of each of the audio frames, and the first probability sequence includes the first text probability distribution of each predicted character corresponding to the speech data; according to the acoustic vector sequence and The second prediction model is to obtain a second probability sequence, wherein the second probability sequence contains the text probability distribution of each of the audio frames; according to the first probability sequence and the second probability sequence, determine the target probability sequence, wherein the target probability sequence includes a target text probability distribution of each of the predicted characters; according to the target
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Among them, the name of the module does not constitute a limitation of the module itself in some cases.
  • the encoding module can also be described as "encoding the received speech data, and obtaining the acoustic vector sequence corresponding to the speech data. module”.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a speech recognition method, the method comprising:
  • the information amount sequence and the first probability sequence corresponding to the speech data are obtained, wherein the information amount sequence includes the information amount of each audio frame, and the first The probability sequence includes a first text probability distribution of each predicted character corresponding to the speech data;
  • the target probability sequence includes a target text probability distribution for each of the predicted characters
  • the target text corresponding to the speech data is determined.
  • Example 2 provides the method of Example 1, wherein the information amount sequence and the first probability sequence corresponding to the speech data are obtained according to the acoustic vector sequence and the first prediction model, include:
  • Example 3 provides the method of Example 1, wherein the second probability sequence is obtained according to the acoustic vector sequence and the second prediction model, including:
  • the probability corresponding to the preset character in the predicted probability distribution of the audio frame is deleted, and the predicted probability distribution obtained after deletion is normalized to obtain the text probability distribution of the audio frame.
  • Example 4 provides the method of Example 1, wherein the determining the target probability sequence according to the first probability sequence and the second probability sequence includes:
  • the target probability sequence is determined according to the first probability sequence and the third probability sequence.
  • Example 5 provides the method of Example 4, combining the text probability distributions of the audio frames in the second probability sequence according to the information amount sequence to obtain the second Three probability sequences, including:
  • the weighted sum of the text probability distributions of each audio frame in the audio frame combination is determined as the second text probability distribution of the predicted characters corresponding to the group of audio frame combinations, wherein each of the The weight corresponding to the audio frame is determined based on the amount of information that the audio frame belongs to the audio frame combination.
  • Example 6 provides the method of Example 4.
  • the determining the target probability sequence according to the first probability sequence and the third probability sequence includes:
  • the weighted sum of the first text probability distribution of the predicted character in the first probability sequence and the second text probability distribution of the predicted character in the third probability sequence is determined. is the target probability distribution of the predicted character.
  • Example 7 provides the method described in any one of Examples 1-6, wherein the first prediction model is a CIF model, and the second prediction model is a CTC model.
  • Example 8 provides a speech recognition device, the device comprising:
  • An encoding module configured to encode the received speech data to obtain an acoustic vector sequence corresponding to the speech data, wherein the acoustic vector sequence includes an acoustic vector of each audio frame of the speech data;
  • a first processing module configured to obtain an information sequence and a first probability sequence corresponding to the speech data according to the acoustic vector sequence and the first prediction model, wherein the information sequence includes each of the audio frames
  • the amount of information, the first probability sequence includes a first text probability distribution of each predicted character corresponding to the voice data
  • a second processing module configured to obtain a second probability sequence according to the acoustic vector sequence and a second prediction model, wherein the second probability sequence includes a text probability distribution of each of the audio frames;
  • a first determining module configured to determine a target probability sequence according to the first probability sequence and the second probability sequence, wherein the target probability sequence includes a target text probability distribution of each of the predicted characters;
  • the second determination module is configured to determine the target text corresponding to the voice data according to the target probability sequence.
  • Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method described in any one of Examples 1-7 are implemented. .
  • Example 10 provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device, so as to implement the steps of the method in any one of examples 1-7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer Vision & Pattern Recognition (AREA)

Abstract

本公开提供一种语音识别方法、装置、介质及电子设备,所述方法包括:对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列;根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列;根据所述声学向量序列和第二预测模型,获得第二概率序列;根据所述第一概率序列和所述第二概率序列,确定目标概率序列;根据所述目标概率序列,确定所述语音数据对应的目标文本。

Description

语音识别方法、装置、介质及电子设备
相关申请的交叉引用
本申请要求于2021年06月30日提交的,申请名称为“语音识别方法、装置、介质及电子设备”的、中国专利申请号为“202110738271.7”的优先权,该中国专利申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,具体地,涉及一种语音识别方法、装置、介质及电子设备。
背景技术
随着深度学习的兴起,各种完全依赖于神经网络进行端到端建模的方法逐渐兴起。在进行语音识别时,由于输入的语音数据和输出的文本数据的长度不同,可以通过对齐算法进行序列对齐映射的方式进行语音识别。相关技术中,为了提高模型对语音识别的准确度,通常会采用多任务学习的方式对模型进行训练,然而在基于模型进行语音识别时,无法利用到训练过程中多任务学习积累的知识,基于该模型进行语音识别难以达到预计的准确度。
技术解决方案
提供该公开内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该公开内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种语音识别方法,所述方法包括:
对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
根据所述目标概率序列,确定所述语音数据对应的目标文本。
可选地,所述根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,包括:
将所述声学向量序列输入所述第一预测模型,获得所述信息量序列;
根据所述信息量序列对所述声学向量序列中所述音频帧的声学向量进行合并,获得字符声学向量序列,其中,所述字符声学向量序列包含每一所述预测字符对应的声学向量;
对所述字符声学向量序列进行解码,获得所述第一概率序列。
可选地,所述根据所述声学向量序列和第二预测模型,获得第二概率序列,包括:
将所述声学向量序列输入所述第二预测模型,获得每一所述音频帧的预测概率分布;
针对每一所述音频帧,将该音频帧的预测概率分布中对应于预设字符的概率删除,并对删除后所得的预测概率分布进行归一化,获得该音频帧的文本概率分布。
可选地,所述根据所述第一概率序列和所述第二概率序列,确定目标概率序列,包括:
根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率 分布进行合并,获得第三概率序列,其中,第三概率序列包含每一所述预测字符的第二文本概率分布;
根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列。
可选地,所述根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,包括:
按照序列顺序遍历所述信息量序列中的信息量,根据所述信息量的累加和对所述音频帧进行分组,获得多个音频帧组合,其中,除最后一个音频帧组合之外的其他音频帧组合所对应的信息量的累加和相同,每一音频帧组合对应于一个预测字符;
针对每一音频帧组合,将该音频帧组合中的每一音频帧的文本概率分布的加权和,确定为该组音频帧组合对应的预测字符的第二文本概率分布,其中,每一所述音频帧对应的权重是基于所述音频帧属于所述音频帧组合的信息量确定的。
可选地,所述根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列,包括:
针对每一所述预测字符,将该预测字符在所述第一概率序列中的第一文本概率分布、和该预测字符在所述第三概率序列中的第二文本概率分布的加权和,确定为该预测字符的目标概率分布。
可选地,其中,所述第一预测模型为CIF模型,所述第二预测模型为CTC模型。
第二方面,提供一种语音识别装置,所述装置包括:
编码模块,用于对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
第一处理模块,用于根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每 一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
第二处理模块,用于根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
第一确定模块,用于根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
第二确定模块,用于根据所述目标概率序列,确定所述语音数据对应的目标文本。
第三方面,提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面任一所述方法的步骤。
第四方面,提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现第一方面任一所述方法的步骤。
在上述技术方案中,对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,之后则可以基于该声学向量序列以及第一预测模型和第二预测模型,分别获得第一概率序列和第二概率序列,进而可以综合该第一概率序列和第二概率序列获得一综合考量的目标概率序列,以根据所述目标概率序列,确定所述语音数据对应的目标文本。由此,通过上述技术方案,可以基于训练过程中的多任务学习对应的多个预测模型分别输出的概率序列,确定用于语音识别的目标概率序列,可以基于训练过程中进行多任务学习积累的知识进行语音识别和解码,明显提升语音识别的准确度和效率,提升用户使用体验。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据本公开的一种实施方式提供的语音识别方法的流程图;
图2是根据声学向量序列和第一预测模型,获得语音数据对应的信息量序列和第一概率序列的示例性实现方式的流程图;
图3是根据本公开的一种实施方式提供的语音识别装置的框图;
图4示出了适于用来实现本公开实施例的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的 装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1所示,为根据本公开的一种实施方式提供的语音识别方法的流程图。如图1所示,所述方法可以包括:
在步骤11中,对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量。
其中,可以通过预先进行训练获得的共享编码器对接收到的语音数据进行编码,从而可以将语音数据转化成声学向量表示,即获得声学向量序列。通常情况下,可以将每秒的语音数据切分为多个音频帧,从而基于音频帧进行数据处理,示例地,可以将每秒的语音数据切分为100个音频帧进行处理。相应地,通过该共享编码器对该语音数据的音频帧进行编码,获得的声学向量序列H可以表示为:
H:{H 1,H 2,…,H U},其中,U用于表示该语音数据中的从语音开始至语音末尾的音频帧的数量,即该声学向量序列的长度。
在步骤12中,根据声学向量序列和第一预测模型,获得语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布。
其中,如上文所述,可以将每秒的语音数据切分成100个音频帧进行处理,每一音频帧对应的信息量可以用于表征该音频帧所包含的信息的多少。 其中,在本公开实施例中,默认每一个预测字符所包含的信息量是相同的,则可以针对每一音频帧的信息量,以该信息量序列从左到右的方式累加确定哪些音频帧对应于一个预测字符,从而可以基于每一预测字符的声学向量获得该第一概率序列。
示例地,该第一预测模型可以是CIF(Continuous Integrate-and-Fire,连续整合发放)模型,确定出的信息量序列W可以表示如下:
W:{W 1,W 2,…,W U}。
第一概率序列P *可以表示如下:
P *:{P * 1,P * 2,…,P * M},其中M用于表示确定出的预设字符的总个数。
在步骤13中,根据声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布。
示例地,第二预测模型可以是CTC(Connectionist temporal classification),其可以理解为基于神经网络的时序类分类。
作为示例,可以对共享编码器、第一预测模型和第二预测模型分别进行训练,从而可以通过训练完成的上述模型分别获得声学向量序列、信息量序列、第一概率序列以及第二概率序列。
作为另一示例,可以将共享编码器、第一预测模型和第二预测模型进行联合端到端训练,如训练数据输入共享编码器,并将该共享编码器输出的向量分别输入第一预测模型和第二预测模型,通过对第一预测模型的输出进行解码获得模型的输出,基于第一预测模型和第二预测模型的损失对该端到端模型进行多任务学习的方式实现该端到端模型的训练。由此,可以通过上述端到端训练的方式获得该共享编码器、第一预测模型和第二预测模型,保证该共享编码器、第一预测模型和第二预测模型之间模型的参数的匹配度。
在步骤14中,根据第一概率序列和第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布。
示例地,该第一概率序列中包含基于第一预测模型确定出的语音数据对 应的概率分布,第二概率序列中包含基于第二预测模型确定出的语音数据对应的概率分布,则在该步骤中可以结合该两个预测模型分别确定出的概率分布进行综合考虑,提高目标概率序列的准确性,为后续进行语音识别提供数据支持。
在步骤15中,根据目标概率序列,确定语音数据对应的目标文本。
其中,所述目标概率序列中包含每一预测字符对应的目标文本概率分布。作为示例,可以根据每一预测字符对应的目标文本概率分布,基于贪心搜索(Greedy Search)的算法,针对第一个预测字符的目标文本概率分布,确定出概率最大的词作为该预测字符的识别字符,之后针对第二个预测字符及之后的各个预测字符的目标文本概率分布,均采用同样的方式确定其分别对应的识别字符,从而根据该识别字符生成该目标文本。
作为另一示例,可以根据每一预测字符对应的目标文本概率分布,基于集束搜索(Beam Search)的算法,针对第一个预测字符的目标文本概率分布,以概率由大至小的顺序排名前N的词作为该预测字符的候选识别字符,之后针对第二个预测字符的目标文本概率分布,结合其之前的候选识别字符对应的概率,确定出第二个预测字符对应的N个候选识别字符,后续预测字符以此类推,从而确定出整个语音数据对应的概率最大的目标文本。
在上述技术方案中,对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,之后则可以基于该声学向量序列以及第一预测模型和第二预测模型,分别获得第一概率序列和第二概率序列,进而可以综合该第一概率序列和第二概率序列获得一综合考量的目标概率序列,以根据所述目标概率序列,确定所述语音数据对应的目标文本。由此,通过上述技术方案,可以基于训练过程中的多任务学习对应的多个预测模型分别输出的概率序列,确定用于语音识别的目标概率序列,可以基于训练过程中进行多任务学习积累的知识进行语音识别和解码,明显提升语音识别的准确度和效率,提升用户使用体验。
在一种可能的实施例中,在步骤12中,根据声学向量序列和第一预测模型,获得语音数据对应的信息量序列和第一概率序列的示例性实现方式如下,如图2所示,该步骤可以包括:
在步骤21中,将声学向量序列输入第一预测模型,获得信息量序列。
示例地,可以将所述声学向量序列输入第一预测模型,则该第一预测模型对声学向量序列中的每一声学向量进行信息量预测。示例地,计算声学向量序列中的每一音频帧的声学向量对应的信息量,可以将以音频帧的声学向量Hu为中心的窗口,输入到一维卷积层之后输入至sigmoid激活的全连接层至输出单元,获得该音频帧的信息量Wu,以获得该信息量序列。
在步骤22中,根据信息量序列对声学向量序列中音频帧的声学向量进行合并,获得字符声学向量序列,其中,所述字符声学向量序列包含每一所述预测字符对应的声学向量。
其中,如上文所述本公开实施例中默认每一预测字符对应的信息量是相同的,因此在本公开实施例中,可以将音频帧对应的信息量序列中的信息量从左到右的方式进行累加,信息量累加至预设阈值时,认为此时该累加的信息量对应的音频帧形成为一个预测字符,一个预测字符对应于一个或多个音频帧。其中,该预设阈值可以根据实际应用场景和经验进行设置,示例地该预设阈值可以设置为1,本公开对此不进行限定。
在一种可能的实施例中,可以通过如下方式根据所述信息量序列对所述声学向量序列中所述音频帧的声学向量进行合并:
按照信息量序列中的序列顺序,依次获取一音频帧i的信息量W i
若W i小于预设阈值β,则获取下一音频帧作为当前的音频帧,即i=i+1,并对遍历到的音频帧的信息量进行累加,若累加和大于该预设阈值,可以认为此时出现了字符边界,即该当前遍历到的音频帧中部分属于当前的预测字符,另一部分属于下一预测字符。
示例地,若W 1+W 2大于β,则可以认为此时出现了字符边界,即第1个 音频帧和第2个音频帧的部分可以对应于一个预测字符,该预测字符的边界处于第2个音频帧中。此时可以将该第2个音频帧的信息量切分为两部分,即一部分信息量属于当前的预测字符,剩余一部分的信息量属于下一预测字符。
相应地,第2个音频帧的信息量W 2中属于当前的预测字符的信息量W 21可以表示为:W 21=β-W 1;属于下一预测字符的信息量W 22可以表示为:W 22=W 1-W 21
之后继续遍历音频帧的信息量,并从该第2个音频帧的剩余一部分的信息量继续进行信息量的累加,即将第2个音频帧中的信息量W 22和第3个音频帧中的信息量W 3进行累加,直至累加至预设阈值β,获得下一预测字符对应的音频帧。针对后续的音频帧的信息量以此类推,通过上述方式进行合并,获得该多个音频帧对应的各个预测字符。
基于此,在确定出该语音数据中预测字符和音频帧的对应关系后,针对每一预测字符,可以将该预测字符对应的每一音频帧的声学向量的加权和确定为该预测字符对应的声学向量。其中,该预测字符对应的每一音频帧的声学向量的权重为该音频帧在该预测字符中对应的信息量。若该音频帧全部属于该预测字符,则该音频帧的声学向量的权重为该音频帧的信息量,若该音频帧部分属于该预测字符,则该音频帧的声学向量的权重为该音频帧中该部分的信息量。
如上文所述示例,针对第一个预测字符,其包含第1个音频帧和第2个音频帧的部分,则该预测字符对应的声学向量C 1可以表示为:
C 1=W 1*H 1+W 21*H 2
又如示例,针对第二个预测字符,其包含第2个音频帧的部分和第3个音频帧,则该预测字符对应的声学向量C 2可以表示为:
C 2=W 22*H 2+W 3*H 3
由此,根据所述信息量序列对所述声学向量序列中所述音频帧的声学向 量进行合并,获得字符声学向量序列,以便于对每一预测字符进行处理。
在步骤23中,对字符声学向量序列进行解码,获得第一概率序列。
示例地,由上文所示方式可以获得每一预测字符对应的字符声学向量,则可以基于解码器对该字符声学向量进行解码,从而获得每一预测字符对应的第一文本概率分布,即将该预测字符识别为各个候选字符的概率。
由此,通过上述技术方案,可以基于每一音频帧的信息量对音频帧的声学向量进行合并,获得对应于每一预测字符的字符声学向量,可以将对应于音频帧量级表示的语音数据映射至字符量级表示,从而可以适用于任意长度的语音数据的语音识别场景,扩展该语音识别方法的使用范围。并且,在上述技术方案中,在对声学向量进行合并的过程中是通过加权和的方式进行确定,无需复杂的计算过程,从而可以在简化语音识别方法的基础上,提高语音识别算法的处理效率,为后续进行字符确定提供有效的数据支持。
在一种可能的实施例中,在步骤13中,根据声学向量序列和第二预测模型,获得第二概率序列的示例性实现方式如下,该步骤可以包括:
将所述声学向量序列输入所述第二预测模型,获得每一所述音频帧的预测概率分布。
其中,该第二预测模型可以为CTC模型,在该预测模型中可以对给定的长度的声学向量序列确定任意长度的文本序列,在该预测模型中,针对输入的声学向量序列会存在一个与之对应相同长度的对齐序列,通过该对齐序列映射至文本序列。相应地,在本公开实施例中,可以将该声学向量序列对应至对齐序列前每一维度上的概率分布确定该为维度的音频帧的预测概率分布。
针对每一所述音频帧,将该音频帧的预测概率分布中对应于预设字符的概率删除,并对删除后所得的预测概率分布进行归一化,获得该音频帧对应的文本概率分布。
其中,为了保证从对齐序列至文本序列输出时对连续的相同字符合并的 准确性,在CTC模型中引入了空字符,该空字符没有含义,在其映射到输出的文本序列时会被移除,在CTC模型中对重复字符进行合并时,会对该空字符之间的连续重复字符进行合并,通过该空字符分隔的重复字符不会被合并,从而保证语音识别获得的识别文本的准确性。
在本公开实施例中,第一预测模型中没有针对该空字符的预测概率,相应地可以通过如下方式对该第二预测模型中的预测概率分布进行处理,以保证第一预测模型和第二预测模型的预测结果的统一性。示例地,可以针对每一音频帧,将该音频帧对应的概率分布中对应于该空字符的概率删除,从而保留该音频帧对应的真实字符的概率分布。而每一音频帧进行上述概率删除后,其对应的概率分布和不一定相同,因此可以对删除预设字符的概率后的预测概率分布进行归一化处理。
示例地,针对音频帧K的预测概率分布为{∈:p 1;s 1:p 2;s 2:p 3;,,,;s n-1:p n}
其中,p 1,p 2至p n的累加和为1,每一音频帧对应于n个字符维度,该n个字符包含一个空字符∈和n-1个真实字符,则可以将该预测概率分布中的∈:P 1删除,并对剩余对应真实字符的概率进行归一化处理:
P’ i=p i/(1-p 1),i=2,3,,,n。
则可以获得第二概率序列P’可以表示如下:P’:{P’ 1,P’ 2,…,P’ n}。
由此,通过上述技术方案,可以通过第二预测模型获得每一音频帧对应的预测概率分布,同时也可以通过对该预测概率分布进行处理,以将其对应的无效字符删除,获得每一音频帧对应于真实字符的概率分布,保证与第一预测模型获得的第一概率分布中对应的字符的一致性,为后续基于第一概率序列和第二概率序列进行语音识别提高统一的标准,保证语音识别的基准相同,从而可以在一定程度上提高语音识别的准确性。
在一种可能的实施例中,所述根据第一概率序列和所述第二概率序列,确定目标概率序列的示例性实现方式如下,该步骤可以包括:
根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率 分布进行合并,获得第三概率序列,其中,第三概率序列包含每一所述预测字符的第二文本概率分布。
其中,如上文所述,第一概率序列包含每一所述预测字符的第一文本概率分布,第二概率序列包含每一音频帧的文本概率分布,为了使得两者统一化为同一量级表示上的概率分布,本公开实施例中可以基于信息量序列对该第二概率序列中的音频帧的文本概率分布进行合并,从而将第二概率序列转换为预测字符量级对应的概率分布,即该第三概率序列。
示例地,所述根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列的示例性实现方式如下,该步骤可以包括:
按照序列顺序遍历所述信息量序列中的信息量,根据所述信息量的累加和对所述音频帧进行分组,获得多个音频帧组合,其中,除最后一个音频帧组合之外的其他音频帧组合所对应的信息量的累加和相同,每一音频帧组合对应于一个预测字符。
其中,在该步骤中可以按照上文所述的根据信息量序列对声学向量序列中音频帧的声学向量进行合并的方式确定属于同一音频帧组合的各个音频帧,在此不再赘述。
针对每一音频帧组合,将该音频帧组合中的每一音频帧的文本概率分布的加权和,确定为该组音频帧组合对应的预测字符的第二文本概率分布,其中,每一所述音频帧对应的权重是基于所述音频帧属于所述音频帧组合的信息量确定的。
在确定出音频帧组合后,可以根据音频帧组合中的每一音频帧对应的权重确定该音频帧组合对应的预测字符的第二文本概率分布。示例地,该音频帧对应的权重可以是该音频帧在其所属音频帧组合中对应的信息量,即上文所述在预测字符中对应的信息量。其中,针对音频帧全部属于或部分属于一音频帧组合时的权重确定方式已在上文进行详述,在此不再赘述。
如上文所述示例,针对第一个音频帧组合,该音频帧组合对应的预测字符的第二文本概率分布P # 1可以通过如下方式确定:
P # 1=W 1*P’ 1+W 21*P’ 2
又如示例,针对第二个音频帧组合,其包含第2个音频帧的部分和第3个音频帧,则该音频帧组合对应的预测字符的第二文本概率分布P # 2可以表示为:
P # 2=W 22*P’ 2+W 3*P’ 3
针对其他音频帧组合,可以采用同样的方式进行处理,从而可以获得各个音频帧组合对应的预测字符的第二文本概率分布,获得的第三概率序列P #可以表示为:
P #:{P # 1,P # 2,…,P # M}。
由此,通过上述技术方案,可以基于信息量序列对第二概率序列中的音频帧的文本概率分布进行合并,通过每一音频帧的信息量将音频帧量级的概率分布转换为预测字符量级的概率分布,实现音频帧到预测字符的转换,适用于任意长度的语音数据的语音识别过程,保证音频帧序列至字符序列转换的准确性和可靠性,保证第三概率序列的准确性,从而为后续确定目标概率序列并进行语音识别提高可靠的数据支持。
之后,根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列。
在该实施例中,确定出的第一概率序列和第三概率序列中均是针对每一预测字符进行文本预测的分布,从而可以基于两个对应于同一量级上的概率分布确定一个综合分布,既能够包含第一预测模型确定出的每一音频帧的信息量相关特征,又可以包含第二预测模型确定出的每一音频帧的文本概率分布,保证目标概率序列中特征的全面性。
在一种可能的实施例中,根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列的示例性实现方式如下,该步骤可以包括:
针对每一所述预测字符,将该预测字符在所述第一概率序列中的第一文本概率分布、和该预测字符在所述第三概率序列中的第二文本概率分布的加权和,确定为该预测字符的目标概率分布。
在该实施例中,对于输出的每一个预测字符的第一文本概率分布和第二文本概率分布进行插值计算,针对每一预测字符i,即有:
P i=P * i+λ*P # i
由此,通过上述技术方案,在对输入的语音数据进行识别时,可以通过结合对字符量级确定出的文本预测概率以及对音频帧量级确定出的文本预测概率,在进行语音识别解码的过程中引入训练过程中多任务学习积累的知识,一方面可以以较低的计算量明显地提升语音识别的准确度,另一方面可以保证语音识别过程中与训练过程中的知识的统一性,保证基于训练完成的模型进行语音识别的准确度与训练时的准确度进行的匹配性,进一步提升语音识别的效率和用户使用体验。
本公开还提供一种语音识别装置,如图3所示,所述装置10包括:
编码模块100,用于对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
第一处理模块200,用于根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
第二处理模块300,用于根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
第一确定模块400,用于根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标 文本概率分布;
第二确定模块500,用于根据所述目标概率序列,确定所述语音数据对应的目标文本。
可选地,第一处理模块包括:
第一输入子模块,用于将所述声学向量序列输入所述第一预测模型,获得所述信息量序列;
第一合并子模块,用于根据所述信息量序列对所述声学向量序列中所述音频帧的声学向量进行合并,获得字符声学向量序列,其中,所述字符声学向量序列包含每一所述预测字符对应的声学向量;
解码子模块,用于对所述字符声学向量序列进行解码,获得所述第一概率序列。
可选地,所述第二处理模块包括:
第二输入子模块,用于将所述声学向量序列输入所述第二预测模型,获得每一所述音频帧的预测概率分布;
处理子模块,用于针对每一所述音频帧,将该音频帧的预测概率分布中对应于预设字符的概率删除,并对删除后所得的预测概率分布进行归一化,获得该音频帧的文本概率分布。
可选地,所述第二确定模块包括:
第二合并子模块,用于根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,其中,第三概率序列包含每一所述预测字符的第二文本概率分布;
第一确定子模块,用于根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列。
可选地,第二合并子模块包括:
分组子模块,用于按照序列顺序遍历所述信息量序列中的信息量,根据所述信息量的累加和对所述音频帧进行分组,获得多个音频帧组合,其中, 除最后一个音频帧组合之外的其他音频帧组合所对应的信息量的累加和相同,每一音频帧组合对应于一个预测字符;
第二确定子模块,用于针对每一音频帧组合,将该音频帧组合中的每一音频帧的文本概率分布的加权和,确定为该组音频帧组合对应的预测字符的第二文本概率分布,其中,每一所述音频帧对应的权重是基于所述音频帧属于所述音频帧组合的信息量确定的。
可选地,所述第一确定子模块包括:
针对每一所述预测字符,将该预测字符在所述第一概率序列中的第一文本概率分布、和该预测字符在所述第三概率序列中的第二文本概率分布的加权和,确定为该预测字符的目标概率分布。
可选地,所述第一预测模型为CIF模型,所述第二预测模型为CTC模型。
下面参考图4,其示出了适于用来实现本公开实施例的电子设备(例如图1中的终端设备或服务器)600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图4所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、 键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质 还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText TransferProtocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;根据所述目标概率序列,确定所述语音数据对应的目标文本。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计 语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,编码模块还可以被描述为“对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或 存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种语音识别方法,所述方法包括:
对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
根据所述目标概率序列,确定所述语音数据对应的目标文本。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,包括:
将所述声学向量序列输入所述第一预测模型,获得所述信息量序列;
根据所述信息量序列对所述声学向量序列中所述音频帧的声学向量进 行合并,获得字符声学向量序列,其中,所述字符声学向量序列包含每一所述预测字符对应的声学向量;
对所述字符声学向量序列进行解码,获得所述第一概率序列。
根据本公开的一个或多个实施例,示例3提供了示例1的方法,所述根据所述声学向量序列和第二预测模型,获得第二概率序列,包括:
将所述声学向量序列输入所述第二预测模型,获得每一所述音频帧的预测概率分布;
针对每一所述音频帧,将该音频帧的预测概率分布中对应于预设字符的概率删除,并对删除后所得的预测概率分布进行归一化,获得该音频帧的文本概率分布。
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述根据所述第一概率序列和所述第二概率序列,确定目标概率序列,包括:
根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,其中,第三概率序列包含每一所述预测字符的第二文本概率分布;
根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列。
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,包括:
按照序列顺序遍历所述信息量序列中的信息量,根据所述信息量的累加和对所述音频帧进行分组,获得多个音频帧组合,其中,除最后一个音频帧组合之外的其他音频帧组合所对应的信息量的累加和相同,每一音频帧组合对应于一个预测字符;
针对每一音频帧组合,将该音频帧组合中的每一音频帧的文本概率分布的加权和,确定为该组音频帧组合对应的预测字符的第二文本概率分布,其中,每一所述音频帧对应的权重是基于所述音频帧属于所述音频帧组合的信 息量确定的。
根据本公开的一个或多个实施例,示例6提供了示例4的方法,所述根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列,包括:
针对每一所述预测字符,将该预测字符在所述第一概率序列中的第一文本概率分布、和该预测字符在所述第三概率序列中的第二文本概率分布的加权和,确定为该预测字符的目标概率分布。
根据本公开的一个或多个实施例,示例7提供了示例1-6中任一所述的方法,其中,所述第一预测模型为CIF模型,所述第二预测模型为CTC模型。
根据本公开的一个或多个实施例,示例8提供了一种语音识别装置,所述装置包括:
编码模块,用于对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
第一处理模块,用于根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
第二处理模块,用于根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
第一确定模块,用于根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
第二确定模块,用于根据所述目标概率序列,确定所述语音数据对应的目标文本。
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质, 其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一示例所述方法的步骤。
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-7中任一示例所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (10)

  1. 一种语音识别方法,所述方法包括:
    对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
    根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
    根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
    根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
    根据所述目标概率序列,确定所述语音数据对应的目标文本。
  2. 根据权利要求1所述的方法,其中,所述根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,包括:
    将所述声学向量序列输入所述第一预测模型,获得所述信息量序列;
    根据所述信息量序列对所述声学向量序列中所述音频帧的声学向量进行合并,获得字符声学向量序列,其中,所述字符声学向量序列包含每一所述预测字符对应的声学向量;
    对所述字符声学向量序列进行解码,获得所述第一概率序列。
  3. 根据权利要求1所述的方法,其中,所述根据所述声学向量序列和第二预测模型,获得第二概率序列,包括:
    将所述声学向量序列输入所述第二预测模型,获得每一所述音频帧的预测概率分布;
    针对每一所述音频帧,将该音频帧的预测概率分布中对应于预设字符的概率删除,并对删除后所得的预测概率分布进行归一化,获得该音频帧的文本概率分布。
  4. 根据权利要求1所述的方法,其中,所述根据所述第一概率序列和所述第二概率序列,确定目标概率序列,包括:
    根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,其中,第三概率序列包含每一所述预测字符的第二文本概率分布;
    根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列。
  5. 根据权利要求4所述的方法,其中,所述根据所述信息量序列对所述第二概率序列中的所述音频帧的文本概率分布进行合并,获得第三概率序列,包括:
    按照序列顺序遍历所述信息量序列中的信息量,根据所述信息量的累加和对所述音频帧进行分组,获得多个音频帧组合,其中,除最后一个音频帧组合之外的其他音频帧组合所对应的信息量的累加和相同,每一音频帧组合对应于一个预测字符;
    针对每一音频帧组合,将该音频帧组合中的每一音频帧的文本概率分布的加权和,确定为该组音频帧组合对应的预测字符的第二文本概率分布,其中,每一所述音频帧对应的权重是基于所述音频帧属于所述音频帧组合的信息量确定的。
  6. 根据权利要求4所述的方法,其中,所述根据所述第一概率序列和所述第三概率序列,确定所述目标概率序列,包括:
    针对每一所述预测字符,将该预测字符在所述第一概率序列中的第一文本概率分布、和该预测字符在所述第三概率序列中的第二文本概率分布的加权和,确定为该预测字符的目标概率分布。
  7. 根据权利要求1-6中任一项所述的方法,其中,所述第一预测模型为CIF模型,所述第二预测模型为CTC模型。
  8. 一种语音识别装置,所述装置包括:
    编码模块,用于对接收到的语音数据进行编码,获得所述语音数据对应的声学向量序列,其中,所述声学向量序列包含所述语音数据的每一音频帧的声学向量;
    第一处理模块,用于根据所述声学向量序列和第一预测模型,获得所述语音数据对应的信息量序列和第一概率序列,其中,所述信息量序列包含每一所述音频帧的信息量,所述第一概率序列包含所述语音数据对应的每一预测字符的第一文本概率分布;
    第二处理模块,用于根据所述声学向量序列和第二预测模型,获得第二概率序列,其中,所述第二概率序列中包含每一所述音频帧的文本概率分布;
    第一确定模块,用于根据所述第一概率序列和所述第二概率序列,确定目标概率序列,其中,所述目标概率序列包含每一所述预测字符的目标文本概率分布;
    第二确定模块,用于根据所述目标概率序列,确定所述语音数据对应的目标文本。
  9. 一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现权利要求1-7中任一项所述方法的步骤。
  10. 一种电子设备,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-7中任一项所述方法的步骤。
PCT/CN2022/091477 2021-06-30 2022-05-07 语音识别方法、装置、介质及电子设备 WO2023273610A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110738271.7 2021-06-30
CN202110738271.7A CN113327599B (zh) 2021-06-30 2021-06-30 语音识别方法、装置、介质及电子设备

Publications (1)

Publication Number Publication Date
WO2023273610A1 true WO2023273610A1 (zh) 2023-01-05

Family

ID=77423552

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091477 WO2023273610A1 (zh) 2021-06-30 2022-05-07 语音识别方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN113327599B (zh)
WO (1) WO2023273610A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327599B (zh) * 2021-06-30 2023-06-02 北京有竹居网络技术有限公司 语音识别方法、装置、介质及电子设备
CN113936643B (zh) * 2021-12-16 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 语音识别方法、语音识别模型、电子设备和存储介质

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
JP2017037222A (ja) * 2015-08-11 2017-02-16 日本電信電話株式会社 特徴量ベクトル算出装置、音声認識装置、特徴量ベクトル算出方法及び特徴量ベクトル算出プログラム
CN109087630A (zh) * 2018-08-29 2018-12-25 深圳追科技有限公司 语音识别的方法及相关装置
CN109147767A (zh) * 2018-08-16 2019-01-04 平安科技(深圳)有限公司 语音中的数字识别方法、装置、计算机设备及存储介质
CN109543005A (zh) * 2018-10-12 2019-03-29 平安科技(深圳)有限公司 客服机器人对话状态识别方法及装置、设备、存储介质
CN111341307A (zh) * 2020-03-13 2020-06-26 腾讯科技(深圳)有限公司 语音识别方法、装置、电子设备及存储介质
CN111613215A (zh) * 2019-02-22 2020-09-01 浙江大学 一种语音识别的方法及其装置
CN111968646A (zh) * 2020-08-25 2020-11-20 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN112397058A (zh) * 2019-07-31 2021-02-23 三星电子株式会社 解码方法、训练方法以及语音识别设备
US10964315B1 (en) * 2017-06-30 2021-03-30 Amazon Technologies, Inc. Monophone-based background modeling for wakeword detection
CN112951209A (zh) * 2021-01-27 2021-06-11 科大讯飞股份有限公司 一种语音识别方法、装置、设备及计算机可读存储介质
CN113327599A (zh) * 2021-06-30 2021-08-31 北京有竹居网络技术有限公司 语音识别方法、装置、介质及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336892A1 (en) * 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
CN109697977B (zh) * 2017-10-23 2023-10-31 三星电子株式会社 语音识别方法和设备
US10672388B2 (en) * 2017-12-15 2020-06-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for open-vocabulary end-to-end speech recognition
CN110648658B (zh) * 2019-09-06 2022-04-08 北京达佳互联信息技术有限公司 一种语音识别模型的生成方法、装置及电子设备
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备
CN112599128B (zh) * 2020-12-31 2024-06-11 百果园技术(新加坡)有限公司 一种语音识别方法、装置、设备和存储介质

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017037222A (ja) * 2015-08-11 2017-02-16 日本電信電話株式会社 特徴量ベクトル算出装置、音声認識装置、特徴量ベクトル算出方法及び特徴量ベクトル算出プログラム
CN106328147A (zh) * 2016-08-31 2017-01-11 中国科学技术大学 语音识别方法和装置
US10964315B1 (en) * 2017-06-30 2021-03-30 Amazon Technologies, Inc. Monophone-based background modeling for wakeword detection
CN109147767A (zh) * 2018-08-16 2019-01-04 平安科技(深圳)有限公司 语音中的数字识别方法、装置、计算机设备及存储介质
CN109087630A (zh) * 2018-08-29 2018-12-25 深圳追科技有限公司 语音识别的方法及相关装置
CN109543005A (zh) * 2018-10-12 2019-03-29 平安科技(深圳)有限公司 客服机器人对话状态识别方法及装置、设备、存储介质
CN111613215A (zh) * 2019-02-22 2020-09-01 浙江大学 一种语音识别的方法及其装置
CN112397058A (zh) * 2019-07-31 2021-02-23 三星电子株式会社 解码方法、训练方法以及语音识别设备
CN111341307A (zh) * 2020-03-13 2020-06-26 腾讯科技(深圳)有限公司 语音识别方法、装置、电子设备及存储介质
CN111968646A (zh) * 2020-08-25 2020-11-20 腾讯科技(深圳)有限公司 一种语音识别方法及装置
CN112951209A (zh) * 2021-01-27 2021-06-11 科大讯飞股份有限公司 一种语音识别方法、装置、设备及计算机可读存储介质
CN113327599A (zh) * 2021-06-30 2021-08-31 北京有竹居网络技术有限公司 语音识别方法、装置、介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN116705058B (zh) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN113327599A (zh) 2021-08-31
CN113327599B (zh) 2023-06-02

Similar Documents

Publication Publication Date Title
WO2023273610A1 (zh) 语音识别方法、装置、介质及电子设备
WO2023273611A1 (zh) 语音识别模型的训练方法、语音识别方法、装置、介质及设备
WO2023273578A1 (zh) 语音识别方法、装置、介质及设备
WO2023273612A1 (zh) 语音识别模型的训练方法、语音识别方法、装置、介质及设备
WO2023165538A1 (zh) 语音识别方法、装置、计算机可读介质及电子设备
WO2020207174A1 (zh) 用于生成量化神经网络的方法和装置
WO2023274187A1 (zh) 基于自然语言推理的信息处理方法、装置和电子设备
JP2023550211A (ja) テキストを生成するための方法および装置
CN113591490B (zh) 信息处理方法、装置和电子设备
WO2023185896A1 (zh) 一种文本生成方法、装置、计算机设备及存储介质
CN116884402A (zh) 语音转文本的方法、装置、电子设备及存储介质
CN111653261A (zh) 语音合成方法、装置、可读存储介质及电子设备
CN113986958B (zh) 文本信息的转换方法、装置、可读介质和电子设备
CN111898338A (zh) 文本生成方法、装置和电子设备
CN113761174A (zh) 一种文本生成方法和装置
CN115103191A (zh) 图像处理方法、装置、设备及存储介质
CN111581455B (zh) 文本生成模型的生成方法、装置和电子设备
CN111626044B (zh) 文本生成方法、装置、电子设备及计算机可读存储介质
CN111737572B (zh) 搜索语句生成方法、装置和电子设备
CN114495081A (zh) 文本识别的方法、装置、可读介质和电子设备
CN114564606A (zh) 一种数据处理方法、装置、电子设备及存储介质
CN113947060A (zh) 文本转换方法、装置、介质及电子设备
US20240221729A1 (en) Voice recognition method and apparatus, medium, and electronic device
CN117376634B (zh) 一种短视频配乐方法、装置、电子设备和存储介质
CN111582456A (zh) 用于生成网络模型信息的方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831437

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18288531

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22831437

Country of ref document: EP

Kind code of ref document: A1