WO2021218843A1 - 流式端到端语音识别方法、装置及电子设备 - Google Patents

流式端到端语音识别方法、装置及电子设备 Download PDF

Info

Publication number
WO2021218843A1
WO2021218843A1 PCT/CN2021/089556 CN2021089556W WO2021218843A1 WO 2021218843 A1 WO2021218843 A1 WO 2021218843A1 CN 2021089556 W CN2021089556 W CN 2021089556W WO 2021218843 A1 WO2021218843 A1 WO 2021218843A1
Authority
WO
WIPO (PCT)
Prior art keywords
output
block
voice
location
activation point
Prior art date
Application number
PCT/CN2021/089556
Other languages
English (en)
French (fr)
Inventor
张仕良
高志付
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to EP21796134.1A priority Critical patent/EP4145442A4/en
Publication of WO2021218843A1 publication Critical patent/WO2021218843A1/zh
Priority to US17/976,464 priority patent/US20230064756A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • This application relates to the technical field of streaming end-to-end speech recognition, in particular to a streaming end-to-end speech recognition method, device and electronic equipment.
  • Voice recognition technology is a technology that allows machines to convert voice signals into corresponding texts or commands through the process of recognition and understanding.
  • end-to-end speech recognition has received more and more attention from academia and industry.
  • end-to-end speech recognition uses a model to jointly optimize the acoustic model and language model, which can not only greatly reduce the complexity of system training, but also achieve a significant improvement in performance.
  • most end-to-end speech recognition systems mainly perform offline speech recognition, and cannot perform streaming real-time speech recognition. That is, it is possible to perform voice recognition and output the recognition result after the user finishes speaking a sentence, but cannot output the recognition result when the voice is heard.
  • the MoCHA model is an end-to-end speech recognition system based on the attention mechanism (Attention-Encoder-Decoder) to achieve a streamlined end-to-end speech recognition solution.
  • the streaming voice information can be first converted into voice acoustic features and input to the Encoder (encoder), and then the Attention module determines the activation point that needs to be decoded and output, and the Decoder (decoder)
  • the specific recognition result also called token, for example, a Chinese character can correspond to a token
  • the Attention model when training the Attention model, it is usually necessary to take a complete sentence of speech as a sample and mark the location of the activation point in the speech to complete the training of the Attention model.
  • the Attention model is: Calculate the Attention coefficient for each frame of the received voice stream, and then determine the activation point by comparing it with a preset threshold. For example, if the Attention coefficient of a frame exceeds the threshold, That is, it can be used as an activation point to notify the decoder to output the token at the location of the activation point.
  • the location of the activation point that needs to be decoded and output is determined according to the prediction result, so that the decoder can decode at the location of the activation point and output the recognition result.
  • the training sample set is input into the prediction model for model training.
  • the cloud service system After the cloud service system receives the call request of the application system, it receives the voice stream provided by the application system;
  • a method for obtaining speech recognition information includes:
  • a method for realizing court self-service case filing including:
  • the prediction result determine the location of the activation point that needs to be decoded and output, so that the decoder can decode at the location of the activation point and determine the recognition result
  • the recognition result is entered into the associated case filing information database.
  • a method for upgrading terminal equipment includes:
  • a streaming end-to-end speech recognition device includes:
  • the encoding unit is used to extract and encode voice acoustic features of the received voice stream in units of frames;
  • the prediction unit is used to perform block processing on the frame that has been coded, and predict the number of activation points that need to be coded and output contained in the same block;
  • the activation point position determining unit is configured to determine the location of the activation point that needs to be decoded and output according to the prediction result, so that the decoder can decode at the location of the activation point and output the recognition result.
  • a device for establishing a predictive model including:
  • the training sample set obtaining unit is configured to obtain a training sample set, the training sample set includes a plurality of block data and annotation information, wherein each block data frame includes an encoding result of separately encoding multiple frames of a voice stream,
  • the label information includes the number of activation points that need to be decoded and output included in each sub-block;
  • the input unit is used to input the training sample set into the prediction model for model training.
  • a device for providing voice recognition services, applied to a cloud service system includes:
  • the voice stream receiving unit is configured to receive the voice stream provided by the application system after receiving the call request of the application system;
  • the encoding unit is used to extract and encode voice acoustic features of the received voice stream in units of frames;
  • the prediction unit is used to perform block processing on the frame that has been coded, and predict the number of activation points that need to be coded and output contained in the same block;
  • the activation point position determining unit is configured to determine the location of the activation point that needs to be decoded and output according to the prediction result, so that the decoder decodes the location where the activation point is located to obtain a speech recognition result;
  • the recognition result returning unit is used to return the voice recognition result to the application system.
  • a device for obtaining speech recognition information, applied to an application system includes:
  • the submission unit is used to submit a call request and the voice stream to be recognized to the cloud service system by calling the interface provided by the cloud service system, and the cloud service system performs voice acoustic characteristics on the received voice stream in units of frames Extract and encode, perform block processing on the encoded frames, and predict the number of activation points that need to be encoded output contained in the same block; after determining the location of the activation point that needs to be decoded and output according to the prediction result , Obtaining a speech recognition result by decoding at the location of the activation point by a decoder;
  • the recognition result receiving unit is configured to receive the voice recognition result returned by the cloud service system.
  • the request receiving unit is used to receive voice input request information for filing a case
  • the encoding unit is used to extract and encode voice acoustic features of the received voice stream in units of frames;
  • the prediction unit is used to perform block processing on the frame that has been coded, and predict the number of activation points that need to be coded and output contained in the same block;
  • the activation point position determining unit is configured to determine the location of the activation point that needs to be decoded and output according to the prediction result, so that the decoder can decode at the location of the activation point and determine the recognition result;
  • the information entry unit is used to enter the recognition result into the associated case filing information database.
  • a terminal equipment upgrade device including:
  • the upgrade suggestion providing unit is used to provide upgrade suggestion information to the terminal device
  • the authority granting unit is configured to, after receiving the upgrade request submitted by the terminal device, grant the terminal device the authority to use the upgraded method to perform streaming voice recognition, and the upgraded method to perform streaming voice recognition includes: Perform voice acoustic feature extraction and coding on the received voice stream for the unit, perform block processing on the encoded frames, and predict the number of activation points contained in the same block that need to be coded and output; according to the prediction result After determining the location of the activation point that needs to be decoded and output, the decoder performs decoding at the location of the activation point to obtain a speech recognition result.
  • the encoded frames in the process of recognizing the speech stream, can be divided into blocks, and the number of activation points that need to be decoded and output contained in the divided blocks can be predicted. In this way, it can be based on The prediction result determines the specific location of the activation point in the specific block, so as to guide the decoder to decode and output the recognition result at the corresponding activation point location. In this way, since it is no longer necessary to compare the Attention coefficient with the threshold value to determine the position of the activation point, and it will not be affected by future frames, the accuracy can be improved.
  • the prediction process of the number of activation points contained in the block is easier to obtain higher accuracy, the mismatch between training and prediction will also be relatively low, thereby improving the streaming end-to-end speech recognition system
  • the robustness to noise has a relatively small impact on system performance.
  • Figure 1 is a schematic diagram of a solution provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of the first method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a second method provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a third method provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a fourth method provided by an embodiment of the present application.
  • FIG. 7 is a flowchart of a fifth method provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a sixth method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a first device provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a second device provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a third device provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a fourth device provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a fifth device provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a sixth device provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • the end-to-end speech recognition system based on the attention mechanism can be used.
  • the function of the prediction module is that the output of the encoder can be divided into blocks first, for example, every 5 frames can be divided into blocks, and so on.
  • the number of activation points (token) that need to be decoded and output contained in each sub-block can be predicted. After that, the location of the activation point can be determined by predicting the number of activation points included in each sub-block, and then the decoder can be instructed to decode and output at the location of such activation point.
  • the position of the activation point can be determined by combining information such as the Attention coefficient corresponding to each frame. Specifically, assuming that each sub-block includes 5 frames, and it is predicted that there are two activation points that need to be decoded and output in a certain sub-block, the position of the two frames with the largest Attention coefficient in the sub-block can be determined as The position of the activation point that needs to be decoded and output, and then the decoder can decode and output at the position where the activation point is located.
  • the judgment of the activation point position no longer depends on the artificially set Attention coefficient threshold, but the number of activation points predicted in each block can be used as a guide.
  • the position of one or more frames with the largest Attention coefficient can be used as the position of the activation point.
  • the process of training the prediction module and the process of actually using the prediction module for testing may also have mismatches.
  • the only mismatch is that the training can use the actual number of prediction output information (Cm) in each block, but the actual test can only use the prediction output of the predictor.
  • Cm prediction output information
  • the accuracy of predicting how many activation points are contained in each block is very high, and the accuracy is more than 95% on the task responsible, the mismatch between training and testing is very low, which is relatively low compared to the existing The MoCHA solution can achieve a significant performance improvement.
  • experiments have shown that the streaming speech recognition solution provided by the embodiments of the present application and the offline speech recognition based on the whole-sentence attention mechanism are basically non-destructive.
  • a cloud service system may provide a cloud speech recognition service. If streaming end-to-end speech recognition is required in this service, the solution provided in the embodiments of this application can be used to utilize streaming speech Recognition module to achieve.
  • the above-mentioned cloud service system can provide a specific prediction model and provide users with a cloud speech recognition interface. Multiple users can call this interface in their respective application systems. After the cloud service system receives the call, it can run the relevant The processing program to achieve streaming speech recognition and return the recognition result.
  • the solutions provided in the embodiments of this application can also be used for voice recognition in a localized voice recognition system or device, for example, a navigation robot in a shopping mall, a self-service case filing machine in a court, etc.
  • the first embodiment provides a streaming end-to-end speech recognition method, referring to Figure 3, including:
  • S301 Perform voice acoustic feature extraction and encoding on the received voice stream in units of frames;
  • the speech acoustic feature can be extracted from the speech stream in units of frames, and encoded in units of frames, and the encoder will output the encoding results corresponding to each frame.
  • the operation of encoding the voice stream may also be performed continuously. For example, assuming that 60ms is a frame, as the voice stream is received, the voice stream of 60ms is used as a frame for feature extraction and encoding processing.
  • the purpose of the encoding process is to transform the received voice acoustic features into a new and more distinguishable high-level expression, which can usually exist in the form of a vector. Therefore, the encoder can be a multi-layer neural network, and there are many choices for neural networks, such as DFSMN, CNN, BLSTM, Transformer, and so on.
  • S302 Perform block processing on the frame that has been encoded, and predict the number of activation points that need to be encoded and output included in the same block;
  • block processing can be performed first, and the number of activation points can be predicted in units of blocks.
  • a block may include multiple frames
  • the encoding result can be buffered first, and the number of buffered encoding result frames Every time the number of frames corresponding to a block is reached, the currently buffered frame encoding result can be determined as a block, and the prediction module can predict the number of activation points contained in the block that need to be encoded and output. For example, if every 5 frames corresponds to a block, the encoder can perform a prediction process after every 5 frames of speech stream are encoded.
  • the encoding results of each frame corresponding to the block may also be deleted from the cache.
  • the encoding result of each frame can usually be used to calculate the Attention coefficient, and can also be weighted and summed with the Attention coefficient, and then provided as the input of the decoder to the decoder for calculation. Therefore, in the specific implementation, in order to avoid mutual influence or conflict between the prediction module and the Attention module on the data, the output of the encoder can be provided to the prediction module and the Attention module respectively, and the prediction module and the Attention module can correspond to different caches. Space, each can process the decoding result data in their respective buffer space to obtain the prediction results of the number of activation points, and the calculation of the Attention coefficient.
  • the decoding result needs to be divided into blocks, a certain delay may occur in the speech recognition process, and the specific delay depends on the size of the divided blocks. For example, if there is a block every 5 frames, the delay time is the time of 5 frames, and so on.
  • the block size can be determined according to the tolerable delay time of the system. For example, in extreme cases, each frame can be used as a block, and so on.
  • the specific prediction module it can be realized through a pre-trained model.
  • a training sample set can be prepared, which includes the encoding result corresponding to the speech stream, divided into blocks according to a certain size, and the number of activation points that need to be output contained in each block can be marked.
  • the above-mentioned sample information and labeling information can be input into the initialized model, and the model parameters can be gradually optimized through multiple rounds of iteration, until the algorithm converges to end the training.
  • the process of adjusting the parameters may specifically be a process of adjusting the weights of each layer in the deep learning model.
  • the prediction model After completing the training of the prediction model, as long as the coding result contained in the same block is input to the prediction model, the prediction model can output information about the number of activation points in the block that need to be decoded and output.
  • the average duration of one character in the voice stream is usually about 200ms, that is, when a user is speaking, the pronunciation of each word may last about 200ms. (Of course, the actual duration may be different due to different speech speeds for different people).
  • the pronunciation of the same modeling unit for example, corresponding to a word in Chinese, or a word in English, etc.
  • the same modeling unit usually only needs to decode and output one of the frames, and associate the features of the surrounding frames with the frame.
  • multiple frames are divided into one sub-block.
  • S303 Determine, according to the prediction result, the location of the activation point that needs to be decoded and output, so that the decoder can decode at the location of the activation point and output the recognition result.
  • the location of the activation point can be further determined, so that the decoder can decode at the location of the activation point and output the recognition result.
  • the prediction result is either 0 or 1, so .
  • the problem of predicting the number of activation points contained in a sub-block has evolved into a problem of predicting whether each sub-block contains activation points that need to be decoded and output. That is, the specific prediction result may be whether the current block contains activation points that need to be encoded and output.
  • the location of the segment containing the activation point may be directly determined as the location of the activation point.
  • the specific prediction results can be as shown in Table 1:
  • the location of the specific activation point can be directly determined.
  • the Attention coefficient can also reflect to a certain extent whether a frame is an activation point, or the probability that a frame belongs to an activation point. Therefore, in specific implementation, the attention coefficient of the encoding result of each frame can also be determined separately, and the attention coefficient is used to describe the probability that the corresponding frame needs to be decoded and output. Then, the prediction result can be verified according to the Attention coefficient.
  • the threshold of the Attention coefficient can be set in advance.
  • the prediction module can also re-predict by adjusting the strategy, for example, combining the features of more surrounding frames. Forecast, etc.
  • the Attention coefficient threshold is still used, but since it is only used to verify the block prediction result, it has little impact on the performance of the entire system.
  • the same block may include the encoding results corresponding to multiple frames of voice streams.
  • the prediction module can only predict how many activation points are included in the same block, but cannot directly determine which frame the activation point is in the block. Therefore, in specific implementation, the Attention coefficient of each frame can also be combined to determine the location of the activation point. Specifically, firstly, the attention coefficient of the coding result of each frame can be determined separately. Then, according to the number of activation points included in the block, the position of the corresponding number of frames with the highest Attention coefficient in the encoding result of each frame included in the block can be determined as the position of the activation point.
  • the positions of the two frames with the highest Attention coefficients in each frame included in the block can be determined as the positions of the two activation points.
  • the specific prediction result of the number of activation points, the Attention coefficient situation, and the position information of the determined activation points can be as shown in Table 2:
  • every 5 frames are regarded as a block, the 0th to 4th frames are divided into one block, the 5th to 9th frames are divided into one block, and so on.
  • the prediction module predicts that the first block contains 1 activation point, and calculates that the respective Attention coefficients of the 0th to 4th frames are 0.01, 0.22, 0.78, 0.95, 0.75, at this time, you can In the above-mentioned 0th to 4th frames, the position of the frame with the highest Attention coefficient is determined as the position of the activation point, that is, the position of the 3rd frame is the activation point, and the other frames in this block are not the activation points. No decoding output is required.
  • the prediction module predicts that the second block contains 2 activation points, and calculates that the Attention coefficients of the 4th to 9th frames are 0.63, 0.88, 0.72, 0.58, and 0.93 respectively.
  • the size of the specific block may be preset, or the initial value may also be preset, and during the specific test process, it may also be dynamically adjusted according to the actual voice stream. Specifically, as mentioned above, due to different speech speeds of different users, the number of modeling units (the number of Chinese characters, etc.) and density input within the same length of time may be different. For this reason, in specific implementation, the size of the block can also be adaptively adjusted according to the predicted occurrence frequency of the activation point. For example, if it is found that the frequency of activation points is high during a certain prediction process, the block can be reduced to shorten the delay. On the contrary, the block can be expanded, so that the recognition delay of the system can change with the speech rate of the inputter.
  • the embodiments of the present application in the process of recognizing the speech stream, it is possible to block the encoded frames and predict the number of activation points contained in the blocks that need to be decoded and output. In this way, The specific location of the activation point can be determined in the specific block according to the prediction result, so as to guide the decoder to decode and output the recognition result at the corresponding activation point location. In this way, since it is no longer necessary to compare the Attention coefficient with the threshold value to determine the position of the activation point, and it will not be affected by future frames, the accuracy can be improved.
  • the prediction process of the number of activation points contained in the block is easier to obtain higher accuracy, the mismatch between training and prediction will also be relatively low, thereby improving the streaming end-to-end speech recognition system
  • the robustness to noise has a relatively small impact on system performance.
  • the second embodiment provides a method for establishing a prediction model.
  • the method may specifically include:
  • S401 Obtain a training sample set, where the training sample set includes multiple block data and annotation information, where each block data frame includes an encoding result of encoding multiple frames of a voice stream, and the annotation information includes each The number of activation points that need to be decoded and output included in each block;
  • S402 Input the training sample set into the prediction model to train the model.
  • the training sample set may include the situation where multiple frames of speech streams corresponding to the same modeling unit are divided into different blocks. In this way, the same text and other modeling units may be divided into multiple different blocks. The situation is trained so that accurate prediction results can be obtained when the same situation is encountered during the test.
  • the third embodiment is an introduction to the scenario when the solution provided by the embodiment of this application is applied in a cloud service system. Specifically, the third embodiment first provides a method for providing voice recognition services from the perspective of the cloud server. Referring to Figure 5, the method may specifically include:
  • the cloud service system After receiving the call request of the application system, the cloud service system receives the voice stream provided by the application system;
  • S502 Perform voice acoustic feature extraction and encoding on the received voice stream in units of frames;
  • S503 Perform block processing on the frame that has been encoded, and predict the number of activation points that need to be encoded and output included in the same block;
  • S504 Determine, according to the prediction result, the location of the activation point that needs to be decoded and output, so that the decoder can decode at the location of the activation point to obtain a speech recognition result;
  • S602 Receive a voice recognition result returned by the cloud service system.
  • S702 Perform voice acoustic feature extraction and encoding on the received voice stream in units of frames;
  • S704 Determine, according to the prediction result, the location of the activation point that needs to be decoded and output, so that the decoder can decode at the location of the activation point and determine the recognition result;
  • the models needed in the specific speech recognition process only need to be saved on the server side, and the terminal device side usually does not need to be upgraded in hardware.
  • the specific process of streaming speech recognition usually involves the collection of user data and submits it to the server, so in specific implementation, you can first push the upgradeable to the specific hardware device through the server It is recommended that if users need to upgrade the device, they can express their needs by inputting voice and other methods. After that, they can submit the specific upgrade request to the server, and the server will process the upgrade request.
  • the server can also judge the status of the specific hardware device, for example, whether the associated user has paid the corresponding resources to obtain the upgraded service, etc., if it is, it can be granted a pass Permission to perform streaming speech recognition in an upgraded way.
  • the hardware device can perform streaming voice recognition in the manner provided in the embodiment of the present application during the subsequent dialogue with the user.
  • the specific streaming speech recognition function can be completed by the server, or, if the hardware resources of the hardware device itself can support, the upgraded recognition model can be directly pushed to the specific hardware device, and the hardware device can directly push the recognition model to the specific hardware device. Complete streaming speech recognition locally, etc.
  • a "switch" function can also be provided, so that users can use the above functions only when necessary, in order to save resources and other purposes.
  • the server can temporarily turn off the function for the user. If charging is involved, it can also trigger the stop of charging.
  • the hardware device can go back to the original way to perform streaming speech recognition, and it may be acceptable even to wait until the user has spoken a sentence before recognizing it. Later, if the user needs to use the hardware device in a work scenario, he can also re-enable the advanced functions provided in the embodiments of the present application, and so on.
  • Embodiment 6 of the present application provides a device upgrade method.
  • the method may include:
  • the upgraded method for streaming voice recognition includes: The received speech stream extracts and encodes the acoustic features of the speech, performs block processing on the encoded frames, and predicts the number of activation points contained in the same block that need to be encoded and output; according to the prediction result, it is determined that the decoding needs to be performed After the location of the output activation point is located, the decoder decodes the location where the activation point is located to obtain a voice recognition result.
  • the terminal device may specifically include a smart speaker device and the like.
  • the terminal device may also close the right to use the upgraded method for streaming voice recognition for the terminal device according to the downgrade request submitted by the terminal device.
  • the embodiments of this application may involve the use of user data. In actual applications, it may be in compliance with the applicable laws and regulations of the country where it is located (for example, the user expressly agrees, and the user is actually notified, Etc.), within the scope permitted by applicable laws and regulations, use user-specific personal data in the solutions described in this article.
  • this embodiment of the present application also provides a streaming end-to-end speech recognition device.
  • the device may specifically include:
  • the encoding unit 901 is configured to extract and encode voice acoustic features of the received voice stream in units of frames;
  • the prediction unit 902 is configured to perform block processing on a frame that has been coded, and predict the number of activation points that need to be coded and output contained in the same block;
  • the activation point position determining unit 903 is configured to determine, according to the prediction result, the location of the activation point that needs to be decoded and output, so that the decoder can decode at the location of the activation point and output the recognition result.
  • the block includes an encoding result corresponding to a frame of voice stream
  • the prediction result includes: whether the current block contains activation points that need to be encoded and output;
  • the activation point position determining unit may be specifically used for:
  • the location of the segment containing the activation point is determined as the location of the activation point.
  • the attention coefficient determining unit is used to determine the attention coefficient of each frame encoding result; the attention coefficient is used to describe the probability that the corresponding frame needs to be decoded and output;
  • the verification unit is configured to verify the prediction result according to the Attention coefficient.
  • the device may also include:
  • the determining unit according to the location of the activation point can be specifically used for:
  • the device may also include:
  • the prediction unit may specifically include:
  • the device may also include:
  • the embodiment of the present application also provides a device for establishing a prediction model.
  • the device includes:
  • the input unit 1002 is used to input the training sample set into the prediction model for model training.
  • the training sample set includes a situation where multiple frames of speech streams corresponding to the same modeling unit are divided into different blocks.
  • the embodiment of the present application also provides a device for providing voice recognition services.
  • the device is applied to a cloud service system and includes:
  • the voice stream receiving unit 1101 is configured to receive the voice stream provided by the application system after receiving the call request of the application system;
  • the prediction unit 1103 is configured to perform block processing on a frame that has been coded, and predict the number of activation points that need to be coded and output contained in the same block;
  • the recognition result returning unit 1105 is configured to return the speech recognition result to the application system.
  • the embodiment of the present application also provides a device for obtaining voice recognition information.
  • the device is applied to an application system and includes:
  • the submission unit 1201 is configured to submit a call request and a voice stream to be recognized to the cloud service system by calling an interface provided by the cloud service system, and the cloud service system performs voice acoustics on the received voice stream in units of frames Feature extraction and coding, block processing of frames that have been coded, and predict the number of activation points that need to be coded and output contained in the same block; determine the location of the activation point that needs to be decoded and output according to the prediction result Afterwards, the voice recognition result is obtained by decoding at the position of the activation point by the decoder;
  • the recognition result receiving unit 1202 is configured to receive the voice recognition result returned by the cloud service system.
  • the embodiment of the present application also provides a court self-service case filing implementation device.
  • the device is applied to the self-service case filing all-in-one device and includes:
  • the request receiving unit 1301 is configured to receive voice input request information for filing a case
  • the encoding unit 1302 is configured to extract and encode voice acoustic features of the received voice stream in units of frames;
  • the activation point position determining unit 1304 is configured to determine the location of the activation point that needs to be decoded and output according to the prediction result, so that the decoder can decode at the location of the activation point and determine the recognition result;
  • the embodiment of the present application also provides a terminal device upgrade device.
  • the device may include:
  • the upgrade suggestion providing unit 1401 is configured to provide the terminal device with upgrade suggestion information
  • the authority granting unit 1402 is configured to, after receiving the upgrade request submitted by the terminal device, grant the terminal device the authority to use the upgraded method to perform streaming voice recognition, and the upgraded method to perform streaming voice recognition includes: Extract and encode voice acoustic features of the received voice stream in units of frames, perform block processing on the encoded frames, and predict the number of activation points contained in the same block that need to be encoded and output; according to the prediction After the result of determining the location of the activation point that needs to be decoded and output, the decoder performs decoding at the location of the activation point to obtain a speech recognition result.
  • an electronic device including:
  • One or more processors and a memory associated with the one or more processors, where the memory is used to store program instructions, which are executed when read and executed by the one or more processors.
  • FIG. 15 exemplarily shows the architecture of an electronic device.
  • the device 1500 can be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant. , Aircraft, etc.
  • the device 1500 may include one or more of the following components: a processing component 1502, a memory 1504, a power supply component 1506, a multimedia component 1508, an audio component 1510, an input and output (I/O) interface 1512, a sensor 1514, and a communication component 1516.
  • the processing component 1502 generally controls the overall operations of the device 1500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 1502 may include one or more processors 1520 to execute instructions to complete all or part of the steps of the method provided by the technical solution of the present disclosure.
  • the processing component 1502 may include one or more modules to facilitate the interaction between the processing component 1502 and other components.
  • the processing component 1502 may include a multimedia module to facilitate the interaction between the multimedia component 1508 and the processing component 1502.
  • the power supply component 1506 provides power for various components of the device 1500.
  • the power supply component 1506 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the device 1500.
  • the input and output interface 1512 provides an interface between the processing component 1502 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the communication component 1516 is configured to facilitate wired or wireless communication between the device 1500 and other devices.
  • the device 1500 can access a wireless network based on a communication standard, such as WiFi, or a mobile communication network such as 2G, 3G, 4G/LTE, and 5G.
  • the communication component 1516 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 1516 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the device 1500 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing devices
  • PLD programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种流式端到端语音识别方法、装置及电子设备,方法包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码(S301);对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测(S302);根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在激活点所在的位置进行解码并输出识别结果(S303)。通过本方法能够提升流式端到端语音识别系统对噪声的鲁棒性,进而提升系统性能以及准确度。

Description

流式端到端语音识别方法、装置及电子设备
本申请要求2020年04月30日递交的申请号为202010366907.5、发明名称为“流式端到端语音识别方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及流式端到端语音识别技术领域,特别是涉及流式端到端语音识别方法、装置及电子设备。
背景技术
语音识别技术就是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术。其中,端到端语音识别受到了学术界和工业界越来越广泛的关注。相比于传统的基于混合系统,端到端语音识别通过一个模型联合优化声学模型,语言模型,不仅可以极大的减少系统训练复杂度,而且可以获得性能的显著提升。但是多数的端到端语音识别系统主要还是进行离线语音识别,而不能进行流式(streaming)实时语音识别。也即,能够在用户说完一句话之后进行语音识别并输出识别结果,而不能在听到声音的时候就进行识别结果的输出。
目前也有一些研究学者提出流式端到端语音识别的方案,但是效果并不明显。例如,MoCHA模型基于注意力机制(Attention-Encoder-Decoder)的端到端语音识别系统实现流式端到端语音识别的方案。其中,在MoCHA中,首先可以将流式语音信息在转换成语音声学特征并输入到Encoder(编码器)后,然后,通过Attention模块确定出需要进行解码输出的激活点,由Decoder(解码器)在激活点所在的位置输出具体的识别结果(也称token,例如,一个中文字符可以对应一个token)。
其中,在对Attention模型进行训练时,通常需要以完整的一句语音作为样本,并标注出语音中的激活点所在的位置,以此完成对Attention模型的训练。但是,在通过Attention模型进行预测时,由于是进行流式的语音识别,因此,输入到模型中的是流式的语音信息,而不是对应了完整的一句话,因此,Attention模型的方式是,为接收到的每一帧语音流计算出Attention系数,然后,通过与一个预置的门限值进行比较的方式,来确定激活点,例如,如果某一帧的Attention系数超过了门限值,即可以作为激活点,通知Decoder在该激活点的位置进行token输出。可见,在MoCHA方案中,训练和测试之间存在很大的不匹配性,而这种不匹配性使得MoCHA对于噪声的鲁棒性会比较差,从而使得基于MoCHA的流式端到端语音识别系统在实际任务上会面临很大的性能损失。另外,由于输入端的是持续的流式语音信号,在计算出某帧的Attention系数时,并不知晓未来的语音帧的情况,因此,即使当前帧的Attention系数大于门限值,也可能存在下一帧的Attention系数比当前帧更大的情况,此时,可能将下一帧作为激活点会更准确。可见,在MoCHA方案中,也存在激活点定位准确度比较低的问题。
因此,如何提升流式端到端语音识别系统对噪声的鲁棒性,进而提升系统性能以及准确度,成为需要本领域技术人员解决的技术问题。
发明内容
本申请提供了流式端到端语音识别方法、装置及电子设备,能够提升流式端到端语音识别系统对噪声的鲁棒性,进而提升系统性能以及准确度。
本申请提供了如下方案:
一种流式端到端语音识别方法,包括:
以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
一种建立预测模型的方法,包括:
获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
将所述训练样本集输入到预测模型中进行模型的训练。
一种提供语音识别服务的方法,包括:
云服务系统接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
将所述语音识别结果返回给所述应用系统。
一种获得语音识别信息的方法,包括:
应用系统通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
接收所述云服务系统返回的语音识别结果。
一种法院自助立案实现方法,包括:
自助立案一体机设备接收语音输入的立案请求信息;
以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
将所述识别结果录入到关联的立案信息数据库中。
一种终端设备升级方法,包括:
向终端设备提供升级建议信息;
接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
一种流式端到端语音识别装置,包括:
编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
一种建立预测模型的装置,包括:
训练样本集获得单元,用于获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
输入单元,用于将所述训练样本集输入到预测模型中进行模型的训练。
一种提供语音识别服务的装置,应用于云服务系统,包括:
语音流接收单元,用于接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
识别结果返回单元,用于将所述语音识别结果返回给所述应用系统。
一种获得语音识别信息的装置,应用于应用系统,包括:
提交单元,用于通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
识别结果接收单元,用于接收所述云服务系统返回的语音识别结果。
一种法院自助立案实现装置,应用于自助立案一体机设备,包括:
请求接收单元,用于接收语音输入的立案请求信息;
编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
信息录入单元,用于将所述识别结果录入到关联的立案信息数据库中。
一种终端设备升级装置,包括:
升级建议提供单元,用于向终端设备提供升级建议信息;
权限授予单元,用于接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
根据本申请提供的具体实施例,本申请公开了以下技术效果:
通过本申请实施例,在对语音流进行识别的过程中,可以针对已完成编码的帧进行分块,并对分块内包含的需要进行解码输出的激活点的数量进行预测,这样,可以根据预测结果在具体分块内确定出激活点具体所在的位置,以此指导解码器在对应的激活点位置进行解码输出识别结果即可。通过这种方式,由于不再需要将Attention系数与门限值进行比对的方式来确定激活点的位置,也不会受到未来帧的影响,因此,准确度能够得到提升。另外,由于对分块内包含的激活点数量的预测过程比较容易获得较高的准确度,因此,训练与预测之间的不匹配度也会比较低,从而提升流式端到端语音识别系统对噪声的鲁棒性,对系统性能产生的影响比较小。
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的方案示意图;
图2是本申请实施例提供的系统架构示意图;
图3是本申请实施例提供的第一方法的流程图;
图4是本申请实施例提供的第二方法的流程图;
图5是本申请实施例提供的第三方法的流程图;
图6是本申请实施例提供的第四方法的流程图;
图7是本申请实施例提供的第五方法的流程图;
图8是本申请实施例提供的第六方法的流程图;
图9是本申请实施例提供的第一装置的示意图;
图10是本申请实施例提供的第二装置的示意图;
图11是本申请实施例提供的第三装置的示意图;
图12是本申请实施例提供的第四装置的示意图;
图13是本申请实施例提供的第五装置的示意图;
图14是本申请实施例提供的第六装置的示意图;
图15是本申请实施例提供的电子设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
在本申请实施例中,为了提升流式端到端语音识别系统对噪声的鲁棒性,进而提升系统性能,如图1所示,可以在基于注意力机制的端到端语音识别系统基础上增加预测模块。该预测模块的功能是,首先可以对编码器的输出进行分块,例如,可以每5帧为一个分块,等等。并且,可以对每个分块中包含的需要进行解码输出的激活点(token)数量进行预测。之后,可以通过对每个分块中包含的激活点数量的预测结果,来确定激活点所在的位置,进而指导解码器在这种激活点所在的位置进行解码输出即可。例如,具体实现时,由于对每个分块中包含的激活点数量进行了预测,因此,可以结合每帧对应的Attention系数等信息,来确定激活点的位置。具体如,假设每个分块中包括5帧,并且预测出某分块中一共有两个需要解码输出的激活点,则该分块中Attention系数最大的两帧所在的位置,就可以确定为需要进行解码输出的激活点所在的位置,之后解码器 可以在这种激活点所在的位置进行解码输出。可见,通过上述方式,使得对激活点位置的判断不再取决于人为设定的Attention系数门限值,而是可以以每个分块中预测出的激活点数量作为指导,在分块内,Attention系数最大的一个或多个帧所在的位置即可作为激活点所在的位置。
当然,本申请实施例提供的方案中,对预测模块进行训练的过程,与实际利用预测模块进行测试的过程,同样可能存在不匹配之处。但是,该不匹配之处仅在于训练可以使用真实的每个分块里的预测输出数目信息(Cm),但是实际测试时候只能采用预测器的预测输出。但是由于预测每个分块里包含多少个激活点的准确性是非常高的,准确性在负责任务上也达到95%以上,从而使得训练和测试的不匹配性很低,从而相对于现有的MoCHA方案,可以获得明显的性能提升。并且,经实验表明,本申请实施例提供的流式语音识别方案与基于整句话注意力机制的离线语音识别性能基本无损。
具体实现时,本申请实施例提供的具体技术方案可以在多种应用场景中使用。例如,如图2所示,某云服务系统中可能会提供云语音识别服务,在该服务如果需要实现流式的端到端语音识别,则可以使用本申请实施例提供的方案利用流式语音识别模块来实现。具体的,上述云服务系统可以提供具体的预测模型,并为用户提供云语音识别接口,多个用户可以在各自的应用系统中调用该接口,云服务系统在接收到调用之后,则可以运行相关的处理程序,实现流式的语音识别,并返回识别结果。或者,还可以在本地化的语音识别系统或者设备中使用本申请实施例提供的方案进行语音识别,例如,商场的导航机器人,法院的自助立案一体机等。
下面对本申请实施例提供的具体技术方案进行详细介绍。
实施例一
首先,该实施例一提供了一种流式端到端语音识别方法,参见图3,包括:
S301:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
在进行流式语音识别的过程中,可以以帧为单位对语音流进行语音声学特征提取,并以帧为单位进行编码,编码器会输出各帧对应的编码结果。并且,由于语音流是持续输入的,因此,对语音流进行编码的操作也可以是在持续进行。例如,假设60ms为一帧,则随着语音流的接收,每60ms的语音流作为一帧进行特征提取以及编码处理。其中,编码处理的目的是,将接收到的语音声学特征转化得到一个新的更具有区分性的高层表达,该高层表达通常可以以向量的形式存在。因此,编码器具体可以是一个多层的神经网络,神经网络的选择有多种,比如DFSMN,CNN,BLSTM,Transformer等等。
S302:对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
在本申请实施例中,在得到编码结果后,首先可以进行分块处理,并以分块为单位进行激活点数量的预测。其中,由于一个分块可能会包括多帧,因此,在一种具体的实 现方式下,在完成对每一帧语音流的编码处理后,可以首先对编码结果进行缓存,缓存的编码结果帧数每达到一个分块对应的帧数时,便可以将当前缓存的各帧编码结果确定为一个分块,预测模块便可以对该分块中包含的需要进行编码输出的激活点数量进行预测。例如,每5帧对应一个分块,则可以在编码器每完成5帧语音流的编码后,进行一次预测处理。在可选的实施方式中,在完成对一个分块的预测后,还可以将该分块对应的各帧编码结果从缓存中进行删除。
当然,由于具体的注意力机制系统中,各帧的编码结果通常还可以用于计算Attention系数,并且还可以与Attention系数进行加权求和后,作为解码器的输入提供给解码器进行运算。因此,在具体实现时,为了避免预测模块与Attention模块之间在数据上相互影响或者冲突,编码器的输出可以分别提供给预测模块以及Attention模块,并且,预测模块与Attention模块可以对应不同的缓存空间,各自可以在各自的缓存空间内对解码结果数据进行处理,以分别获得对激活点数量的预测结果,以及对Attention系数的计算等操作。
其中,由于需要对解码结果进行分块,因此,使得语音识别的过程可能会产生一定的延迟,具体延迟的大小取决于分块的大小。例如,每5帧一个分块,则延迟时间就是5帧的时间,等等。具体实现时,可以根据系统可容忍的延迟时间,确定分块的大小。例如,在极端情况下,可以每一帧作为一个分块,等等。
关于具体的预测模块,可以是通过预先训练的模型来实现。为了完成对模型的训练,可以准备训练样本集,其中包括语音流对应的编码结果,按照一定的大小进行分块,并且可以对每个分块中包含的需要输出的激活点数量进行标注。具体在对预测模型进行训练时,可以将上述样本信息以及标注信息输入到初始化的模型中,通过多轮迭代实现对模型参数的逐渐优化,直到算法收敛时结束训练。其中,如果具体的预测模型是使用神经网络等深度学习模型来实现,则对参数进行调整的过程,具体可以是对深度学习模型中各层的权重进行调整的过程。
在完成对预测模型的训练之后,只要将同一分块包含的编码结果输入到该预测模型中,该预测模型即可输出该分块中需要进行解码输出的激活点的数量信息。
需要说明的是,由于对预测模型进行训练时,使用的是真实的每个分块里的预测输出数目信息(Cm),但是实际测试时候只能采用预测器的预测输出。但是由于由于预测每个分块里包含多少个激活点的准确性是可以非常高的,因此,相对于MoCHA系统而言,训练和测试之间的不匹配性会很低,基本不会影响识别性能。
另外需要说明的是,对于语音流而言,其中的一个字符在语音流中存在的平均持续时间通常在200ms左右,也即,某用户在说话过程中,每个字的发音可能会持续200ms左右(当然,不同的人由于语速不同,实际的持续时间也可能会有所差别)。如果60ms为一帧,则同一个建模单元(例如对应中文中的一个字,或者英文中的一个单词等)的 发音可能分布在多个连续的帧中。但实际上同一个建模单元通常只需要在其中一帧上进行解码输出,并为该帧关联上周边帧的特征即可。而本申请实施例中,将多帧分成一个分块,因此,可能存在同一个建模单元所在的帧被划分到多个不同的分块中的情况。因此,为了避免同一个建模单元对应的不同帧在不同的分块中均被识别为激活点,可以在对预测模型进行训练时便考虑到该问题,也即,可以存在对应上述情况的训练样本以及对应的标注信息。这样完成训练后的模型在具体进行测试时,便可以正确应对上述情况的发生。
S303:根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
在预测出每个分块中包含的需要进行解码输出的激活点的数量后,还可以进一步确定出激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。其中,具体实现时,如果一个分块中仅包括一帧语音流对应的编码结果,则在对每个分块中包含的激活点数量进行预测时,预测结果要么为0,要么为1,因此,对分块中包含的激活点数量进行预测的问题,就演变成了对每个分块中是否包含需要进行解码输出的激活点进行预测的问题。也即,具体的预测结果可以是,当前分块中是否包含需要进行编码输出的激活点。在这种情况下,可以直接将包含所述激活点的分块所在的位置确定为所述激活点所在的位置。例如,具体的预测结果可以如表1所示:
表1
Figure PCTCN2021089556-appb-000001
可见,在以一帧为单位进行分块的情况下,对分块完成激活点数量的预测后,就可以直接确定出具体激活点所在的位置。或者,在另一种方式下,由于Attention系数也可以从一定程度上反映出某帧是否为激活点,或者,某帧属于激活点的概率。因此,在具体实现时,还可以分别确定各帧编码结果的注意力Attention系数,该Attention系数用于描述对应帧需要进行解码输出的概率。然后,可以根据所述Attention系数对所述预测结果进行验证。例如,具体实现时,可以预先设定Attention系数的阈值,如果某帧所在的分块预测结果显示该帧属于激活点,并且该帧的Attention系数也大于上述阈值,则可以 进一步提升该帧数与激活点的概率,等等。反之,如果分块预测结果显示某帧属于激活点,但是,计算出的Attention系数却很低,则预测模块还可以通过调整策略等方式来重新预测,例如,可以结合更多周边帧的特征进行预测,等等。当然,在该方式中,仍然用到了Attention系数的阈值,但是,由于仅仅是用于对分块预测结果进行验证,因此,对整个系统性能的影响不大。
另一种方式下,同一分块中可以包括多帧语音流对应的编码结果。此时,预测模块只能预测出同一分块中包括几个激活点,但是不能直接确定激活点具体在该分块中哪一帧处。因此,在具体实现,还可以结合各帧的Attention系数来对激活点的位置进行确定。具体的,首先可以分别确定各帧编码结果的注意力Attention系数。然后,可以根据分块中包含的激活点数量,将该分块包含的各帧编码结果中Attention系数最高的对应数量的帧所在的位置,确定为所述激活点所在的位置。也就是说,假设预测出某分块中包括两个激活点,则可以将该分块包含的各帧中Attention系数最高的两帧所在的位置,确定为两个激活点所在的位置。例如,在一个例子中,具体的激活点数量预测结果,Attention系数情况,以及确定出的激活点的位置信息,可以如表2所示:
表2
Figure PCTCN2021089556-appb-000002
在上述表格中,是每5帧作为一个分块,则第0到第4帧被划分为一个分块,第5到第9帧被划分为一个分块,以此类推。其中,假设预测模块预测出第一个分块中包含1个激活点,并且计算出第0到第4帧各自的Attention系数分别为0.01、0.22、0.78、0.95、0.75,此时,便可以将上述第0到第4帧中,Attention系数最高的一帧所在的位置确定为激活点所在的位置,也即第3帧所在的位置为激活点,该分块中其他各帧不属于激活点,不需要进行解码输出。类似的,假设预测模块预测出第二个分块中包含2个激活点, 并且计算出第4到第9帧各自的Attention系数分别为0.63、0.88、0.72、0.58、0.93,此时,便可以将上述第5到第9帧中,Attention系数最高的两帧所在的位置确定为激活点所在的位置,也即第6帧、第9帧所在的位置为激活点,该分块中其他各帧不属于激活点,不需要进行解码输出。
可见,通过本申请实施例中所述的方式,在对激活点的位置进行判断时,不需要将Attention系数与预置的门限值进行比对,而是在预测出分块中存在的激活点数量的情况下,在该分块内包括的各帧之间进行Attention系数的比对,将其中对应数量的较大者所在的帧作为激活点所在的位置。这样,由于训练以及测试时,都可以统一按照上述方式进行,因此,可以提高训练与测试之间的匹配度,降低对系统性能的影响。另外,由于具体的Attention系数比对操作可以在同一分块内进行,不会受到未来帧的影响,因此,确定出的激活点位置的准确的也会比较高。
需要说明的是,具体实现时,具体分块的大小可以是预先设定的,或者,还可以预先设定初始值,在具体进行测试的过程中还可以根据实际语音流的情况进行动态调整。具体的,如前文所述,不同用户由于语速不同,同样时间长度内输入的建模单元的数量(中文文字的数量等)以及密度可能是不同的。为此,在具体实现时,还可以根据已预测出的激活点的出现频度,对分块的大小进行自适应调节。例如,如果某此预测过程中发现激活点频度较高,则可以缩小分块,以缩短延迟,反之,可以扩大分块,从而使得系统的识别延迟能够跟随输入者的语速而变化。
总之,通过本申请实施例,在对语音流进行识别的过程中,可以针对已完成编码的帧进行分块,并对分块内包含的需要进行解码输出的激活点的数量进行预测,这样,可以根据预测结果在具体分块内确定出激活点具体所在的位置,以此指导解码器在对应的激活点位置进行解码输出识别结果即可。通过这种方式,由于不再需要将Attention系数与门限值进行比对的方式来确定激活点的位置,也不会受到未来帧的影响,因此,准确度能够得到提升。另外,由于对分块内包含的激活点数量的预测过程比较容易获得较高的准确度,因此,训练与预测之间的不匹配度也会比较低,从而提升流式端到端语音识别系统对噪声的鲁棒性,对系统性能产生的影响比较小。
实施例二
该实施例二提供了一种建立预测模型的方法,参见图4,该方法具体可以包括:
S401:获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
S402:将所述训练样本集输入到预测模型中进行模型的训练。
具体实现时,所述训练样本集中可以包括同一建模单元对应的多帧语音流被划分到不同的分块中的情况,这样可以是的同一文字等建模单元被划分到多个不同分块的情况 得到训练,从而在测试过程中遇到同样情况时能够获得准确的预测结果。
实施例三
该实施例三是针对本申请实施例提供的方案在云服务系统中应用时的场景进行介绍,具体的,该实施例三首先从云服务端的角度,提供了一种提供语音识别服务的方法,参见图5,该方法具体可以包括:
S501:云服务系统接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
S502:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
S503:对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
S504:根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
S505:将所述语音识别结果返回给所述应用系统。
实施例四
该实施例四是与实施例三相对应的,从应用系统的角度,提供了一种获得语音识别信息的方法,参见图6,该方法具体可以包括:
S601:应用系统通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
S602:接收所述云服务系统返回的语音识别结果。
实施例五
该实施例五是针对本申请实施例提供的方案在法院的自助立案一体机中的应用场景进行介绍,具体的,参见图7,该实施例五提供了一种法院自助立案实现方法,该方法可以包括:
S701:自助立案一体机设备接收语音输入的立案请求信息;
S702:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
S703:对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
S704:根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
S705:将所述识别结果录入到关联的立案信息数据库中。
实施例六
前述各实施例对本申请实施例提供的流式语音识别方法以及在具体场景中的应用进 行了介绍。具体实现时,对于在智能音箱等硬件设备中的应用场景,由于用户在购买具体的硬件设备时,可能尚未实现本申请实施例提供的功能,以使得比较“老”的硬件设备只能通过传统的方式进行语音识别。而在本申请实施例中,为了使得这部分“老”的硬件设备也能够通过新的方式进行流式语音识别,以提升用户的体验,可以为终端设备提供升级方案。例如,具体实现时,可以在服务端提供具体的流式语音识别的处理流程,具体的硬件设备侧只需要将采集到的用户语音流提交到服务端即可。在这种情况下,具体的语音识别过程中所需要用到的模型等只需要在服务端进行保存即可,终端设备侧通常不需要进行硬件上的改进即可实现升级。当然,由于具体在进行流式语音识别的过程中,通常会涉及到对用户数据的采集并提交到服务端,因此,具体实现时,可以首先通过服务端向具体的硬件设备推送可以进行升级的建议,如果用户需要对设备进行升级,则可以通过输入语音等方式对自己的需求进行表达,之后,便可以将具体的升级请求提交到服务端,由服务端对升级请求进行处理。其中,具体实现时,服务端还可以对具体硬件设备的状态进行判断,例如,关联的用户是否已经为获得升级后的服务而付出相应的资源,等等,如果是,则可以为其赋予通过升级后的方式进行流式语音识别的权限。这样,该硬件设备后续在与用户进行对话的过程中,便可以通过本申请实施例提供的方式进行流式语音识别。具体的,具体的流式语音识别功能可以由服务端完成,或者,对于硬件设备自身硬件资源能够支持的情况下,还可以直接将升级后的识别模型推送到具体的硬件设备,由硬件设备在本地完成流式语音识别,等等。
另外,对于具体的模型保存在服务端的情形,还可以提供“开关”功能,以使得用户可以仅在必要时使用上述功能,以达到节省资源等目的。例如,当用户仅需要与在家庭场景中使用时,由于对语音识别的精度等要求不高,则可以通过发出语音指令等方式来提交关闭上述高级功能(也即本申请实施例提供的识别方式)的请求,之后,服务端可以为该用户暂时关闭该功能,如果涉及到计费等,则也可以触发计费停止。硬件设备可以退回到原来的方式进行流式语音识别,甚至等到用户说完一句话之后再进行识别可能也是能够接受的。后续如果用户需要在工作场景中使用该硬件设备,则还可以再重新开启本申请实施例中提供的高级功能,等等。
具体的,本申请实施例六提供了一种设备升级方法,参见图8,该方法可以包括:
S801:向终端设备提供升级建议信息;
S802:接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
其中,所述终端设备具体可以包括智能音箱设备等。
具体实现时,还可以根据所述终端设备提交的降级请求,为所述终端设备关闭所述使用升级后的方式进行流式语音识别的权限。
其中,关于前述实施例二至实施例六中的未详述部分,可以参见实施例一中的记载,这里不再赘述。
需要说明的是,本申请实施例中可能会涉及到对用户数据的使用,在实际应用中,可以在符合所在国的适用法律法规要求的情况下(例如,用户明确同意,对用户切实通知,等),在适用法律法规允许的范围内在本文描述的方案中使用用户特定的个人数据。
与实施例一相对应,本申请实施例还提供了一种流式端到端语音识别装置,参见图9,该装置具体可以包括:
编码单元901,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元902,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元903,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
其中,所述分块中包括一帧语音流对应的编码结果;
所述预测结果包括:当前分块中是否包含需要进行编码输出的激活点;
所述激活点位置确定单元具体可以用于:
将包含所述激活点的分块所在的位置确定为所述激活点所在的位置。
此时,该装置还可以包括:
注意力系数确定单元,用于分别确定各帧编码结果的注意力Attention系数;所述Attention系数用于描述对应帧需要进行解码输出的概率;
验证单元,用于根据所述Attention系数对所述预测结果进行验证。
或者,所述分块中包括多帧语音流对应的编码结果;
此时,该装置还可以包括:
注意力确定单元,用于分别确定各帧编码结果的注意力Attention系数;所述Attention系数用于描述对应帧需要进行解码输出的概率;
根据激活点位置确定单元具体可以用于:
将同一分块中各帧的Attention系数进行比对,并按照大小进行排序;
根据分块中包含的激活点数量,将该分块包含的各帧编码结果中Attention系数最高的对应数量的帧所在的位置,确定为所述激活点所在的位置。
此时,该装置还可以包括:
分块调节单元,用于根据已预测出的激活点的出现频度,对分块的大小进行自适应 调节。
其中,所述预测单元具体可以包括:
缓存子单元,用于将所述编码结果进行缓存;
分块确定子单元,用于在加入缓存的编码结果帧数达到分块的大小时,将当前缓存的各帧编码结果确定为一个分块。
具体的,该装置还可以包括:
在完成对该分块的预测处理后,将该分块的各帧编码结果从缓存中删除。
与实施例二相对应,本申请实施例还提供了一种建立预测模型的装置,参见图10,该装置包括:
训练样本集获得单元1001,用于获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
输入单元1002,用于将所述训练样本集输入到预测模型中进行模型的训练。
其中,所述训练样本集中包括同一建模单元对应的多帧语音流被划分到不同的分块中的情况。
与实施例三相对应,本申请实施例还提供了一种提供语音识别服务的装置,参见图11,该装置应用于云服务系统,包括:
语音流接收单元1101,用于接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
编码单元1102,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元1103,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元1104,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
识别结果返回单元1105,用于将所述语音识别结果返回给所述应用系统。
与实施例四相对应,本申请实施例还提供了一种获得语音识别信息的装置,参见图12,该装置应用于应用系统,包括:
提交单元1201,用于通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
识别结果接收单元1202,用于接收所述云服务系统返回的语音识别结果。
与实施例五相对应,本申请实施例还提供了一种法院自助立案实现装置,参见图13,该装置应用于自助立案一体机设备,包括:
请求接收单元1301,用于接收语音输入的立案请求信息;
编码单元1302,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
预测单元1303,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
激活点位置确定单元1304,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
信息录入单元1305,用于将所述识别结果录入到关联的立案信息数据库中。
与实施例六相对应,本申请实施例还提供了一种终端设备升级装置,参见图14,该装置可以包括:
升级建议提供单元1401,用于向终端设备提供升级建议信息;
权限授予单元1402,用于接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
另外,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述方法实施例中任一项所述的方法的步骤。
以及一种电子设备,包括:
一个或多个处理器;以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行前述方法实施例中任一项所述的方法的步骤。
其中,图15示例性的展示出了电子设备的架构,例如,设备1500可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理,飞行器等。
参照图15,设备1500可以包括以下一个或多个组件:处理组件1502,存储器1504,电源组件1506,多媒体组件1508,音频组件1510,输入输出(I/O)接口1512,传感器1514,以及通信组件1516。
处理组件1502通常控制设备1500的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件1502可以包括一个或多个处理器1520来执行指令,以完成本公开技术方案提供的方法的全部或部分步骤。此外,处理组件1502 可以包括一个或多个模块,便于处理组件1502和其他组件之间的交互。例如,处理组件1502可以包括多媒体模块,以方便多媒体组件1508和处理组件1502之间的交互。
存储器1504被配置为存储各种类型的数据以支持在设备1500的操作。这些数据的示例包括用于在设备1500上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器1504可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件1506为设备1500的各种组件提供电力。电源组件1506可以包括电源管理系统,一个或多个电源,及其他与为设备1500生成、管理和分配电力相关联的组件。
多媒体组件1508包括在设备1500和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件1508包括一个前置摄像头和/或后置摄像头。当设备1500处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件1510被配置为输出和/或输入音频信号。例如,音频组件1510包括一个麦克风(MIC),当设备1500处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器1504或经由通信组件1516发送。在一些实施例中,音频组件1510还包括一个扬声器,用于输出音频信号。
输入输出接口1512为处理组件1502和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器1514包括一个或多个传感器,用于为设备1500提供各个方面的状态评估。例如,传感器1514可以检测到设备1500的打开/关闭状态,组件的相对定位,例如所述组件为设备1500的显示器和小键盘,传感器1514还可以检测设备1500或设备1500一个组件的位置改变,用户与设备1500接触的存在或不存在,设备1500方位或加速/减速和设备1500的温度变化。传感器1514可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器1514还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器1514还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件1516被配置为便于设备1500和其他设备之间有线或无线方式的通信。设备1500可以接入基于通信标准的无线网络,如WiFi,或2G、3G、4G/LTE、5G等移动通信网络。在一个示例性实施例中,通信组件1516经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件1516还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,设备1500可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器1504,上述指令可由设备1500的处理器1520执行以完成本公开技术方案提供的方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上对本申请所提供的流式端到端语音识别方法、装置及电子设备,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。

Claims (23)

  1. 一种流式端到端语音识别方法,其特征在于,包括:
    以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
  2. 根据权利要求1所述的方法,其特征在于,
    所述分块中包括一帧语音流对应的编码结果;
    所述预测结果包括:当前分块中是否包含需要进行编码输出的激活点;
    所述根据预测结果确定需要进行解码输出的激活点所在的位置,包括:
    将包含所述激活点的分块所在的位置确定为所述激活点所在的位置。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    分别确定各帧编码结果的注意力Attention系数;所述Attention系数用于描述对应帧需要进行解码输出的概率;
    根据所述Attention系数对所述预测结果进行验证。
  4. 根据权利要求1所述的方法,其特征在于,
    所述分块中包括多帧语音流对应的编码结果;
    所述方法还包括:
    分别确定各帧编码结果的注意力Attention系数;所述Attention系数用于描述对应帧需要进行解码输出的概率;
    根据预测结果确定需要进行解码输出的激活点所在的位置,包括:
    将同一分块中各帧的Attention系数进行比对,并按照大小进行排序;
    根据分块中包含的激活点数量,将该分块包含的各帧编码结果中Attention系数最高的对应数量的帧所在的位置,确定为所述激活点所在的位置。
  5. 根据权利要求4所述的方法,其特征在于,还包括:
    根据已预测出的激活点的出现频度,对分块的大小进行自适应调节。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,
    对编码结果进行分块处理,包括:
    将所述编码结果进行缓存;
    在加入缓存的编码结果帧数达到分块的大小时,将当前缓存的各帧编码结果确定为一个分块。
  7. 根据权利要求6所述的方法,其特征在于,还包括:
    在完成对该分块的预测处理后,将该分块的各帧编码结果从缓存中删除。
  8. 一种建立预测模型的方法,其特征在于,包括:
    获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
    将所述训练样本集输入到预测模型中进行模型的训练。
  9. 根据权利要求8所述的方法,其特征在于,
    所述训练样本集中包括同一建模单元对应的多帧语音流被划分到不同的分块中的情况。
  10. 一种提供语音识别服务的方法,其特征在于,包括:
    云服务系统接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
    以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
    将所述语音识别结果返回给所述应用系统。
  11. 一种获得语音识别信息的方法,其特征在于,包括:
    应用系统通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
    接收所述云服务系统返回的语音识别结果。
  12. 一种法院自助立案实现方法,其特征在于,包括:
    自助立案一体机设备接收语音输入的立案请求信息;
    以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
    将所述识别结果录入到关联的立案信息数据库中。
  13. 一种终端设备升级方法,其特征在于,包括:
    向终端设备提供升级建议信息;
    接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的 语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
  14. 根据权利要求13所述的方法,其特征在于,
    所述终端设备包括智能音箱设备。
  15. 根据权利要求13所述的方法,其特征在于,还包括:
    根据所述终端设备提交的降级请求,为所述终端设备关闭所述使用升级后的方式进行流式语音识别的权限。
  16. 一种流式端到端语音识别装置,其特征在于,包括:
    编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。
  17. 一种建立预测模型的装置,其特征在于,包括:
    训练样本集获得单元,用于获得训练样本集,所述训练样本集中包括多个分块数据以及标注信息,其中,每个分块数据帧包括对语音流的多帧分别进行编码的编码结果,所述标注信息包括每个分块中包括的需要进行解码输出的激活点的数量;
    输入单元,用于将所述训练样本集输入到预测模型中进行模型的训练。
  18. 一种提供语音识别服务的装置,其特征在于,应用于云服务系统,包括:
    语音流接收单元,用于接收到应用系统的调用请求后,接收所述应用系统提供的语音流;
    编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码获得语音识别结果;
    识别结果返回单元,用于将所述语音识别结果返回给所述应用系统。
  19. 一种获得语音识别信息的装置,其特征在于,应用于应用系统,包括:
    提交单元,用于通过调用云服务系统提供的接口,向所述云服务系统提交调用请求以及待识别的语音流,由所述云服务系统以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果;
    识别结果接收单元,用于接收所述云服务系统返回的语音识别结果。
  20. 一种法院自助立案实现装置,其特征在于,应用于自助立案一体机设备,包括:
    请求接收单元,用于接收语音输入的立案请求信息;
    编码单元,用于以帧为单位对接收到的语音流进行语音声学特征提取并进行编码;
    预测单元,用于对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;
    激活点位置确定单元,用于根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并确定识别结果;
    信息录入单元,用于将所述识别结果录入到关联的立案信息数据库中。
  21. 一种终端设备升级装置,其特征在于,包括:
    升级建议提供单元,用于向终端设备提供升级建议信息;
    权限授予单元,用于接收到终端设备提交的升级请求后,为所述终端设备赋予使用升级后的方式进行流式语音识别的权限,所述升级后的方式进行流式语音识别包括:以帧为单位对接收到的语音流进行语音声学特征提取并进行编码,对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测;根据预测结果确定需要进行解码输出的激活点所在的位置后,通过解码器在所述激活点所在的位置进行解码获得语音识别结果。
  22. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至15任一项所述的方法的步骤。
  23. 一种电子设备,其特征在于,包括:
    一个或多个处理器;以及
    与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行权利要求1至15任一项所述的方法的步骤。
PCT/CN2021/089556 2020-04-30 2021-04-25 流式端到端语音识别方法、装置及电子设备 WO2021218843A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21796134.1A EP4145442A4 (en) 2020-04-30 2021-04-25 METHOD AND APPARATUS FOR DETECTING STREAMING END-TO-END SPEECH AND ELECTRONIC DEVICE
US17/976,464 US20230064756A1 (en) 2020-04-30 2022-10-28 Streaming End-to-End Speech Recognition Method, Apparatus and Electronic Device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010366907.5 2020-04-30
CN202010366907.5A CN113593539A (zh) 2020-04-30 2020-04-30 流式端到端语音识别方法、装置及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/976,464 Continuation US20230064756A1 (en) 2020-04-30 2022-10-28 Streaming End-to-End Speech Recognition Method, Apparatus and Electronic Device

Publications (1)

Publication Number Publication Date
WO2021218843A1 true WO2021218843A1 (zh) 2021-11-04

Family

ID=78237580

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/089556 WO2021218843A1 (zh) 2020-04-30 2021-04-25 流式端到端语音识别方法、装置及电子设备

Country Status (4)

Country Link
US (1) US20230064756A1 (zh)
EP (1) EP4145442A4 (zh)
CN (1) CN113593539A (zh)
WO (1) WO2021218843A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968629A (zh) * 2020-07-08 2020-11-20 重庆邮电大学 一种结合Transformer和CNN-DFSMN-CTC的中文语音识别方法
CN114822540A (zh) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 车辆语音交互方法、服务器和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100434522B1 (ko) * 1997-04-29 2004-07-16 삼성전자주식회사 시간축 상호관계를 이용한 음성인식 방법
CN105355197A (zh) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 用于语音识别系统的增益处理方法及装置
CN107680597A (zh) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110473529A (zh) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 一种基于自注意力机制的流式语音转写系统
CN110556099A (zh) * 2019-09-12 2019-12-10 出门问问信息科技有限公司 一种命令词控制方法及设备
CN110689879A (zh) * 2019-10-10 2020-01-14 中国科学院自动化研究所 端到端语音转写模型的训练方法、系统、装置

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711536B2 (en) * 1998-10-20 2004-03-23 Canon Kabushiki Kaisha Speech processing apparatus and method
CN101478616A (zh) * 2008-12-19 2009-07-08 深圳市神舟电脑股份有限公司 一种即时语音通信方法
KR20170007107A (ko) * 2015-07-10 2017-01-18 한국전자통신연구원 음성인식 시스템 및 방법
CN105513589B (zh) * 2015-12-18 2020-04-28 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN107919116B (zh) * 2016-10-11 2019-09-13 芋头科技(杭州)有限公司 一种语音激活检测方法及装置
CN106601228B (zh) * 2016-12-09 2020-02-04 百度在线网络技术(北京)有限公司 基于人工智能韵律预测的样本标注方法及装置
CN108737841B (zh) * 2017-04-21 2020-11-24 腾讯科技(深圳)有限公司 编码单元深度确定方法及装置
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
US10210648B2 (en) * 2017-05-16 2019-02-19 Apple Inc. Emojicon puppeting
CN107622769B (zh) * 2017-08-28 2021-04-06 科大讯飞股份有限公司 号码修改方法及装置、存储介质、电子设备
CN109697977B (zh) * 2017-10-23 2023-10-31 三星电子株式会社 语音识别方法和设备
US10937414B2 (en) * 2018-05-08 2021-03-02 Facebook Technologies, Llc Systems and methods for text input using neuromuscular information
EP3766065A1 (en) * 2018-05-18 2021-01-20 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
US11145293B2 (en) * 2018-07-20 2021-10-12 Google Llc Speech recognition with sequence-to-sequence models
CN110265035B (zh) * 2019-04-25 2021-08-06 武汉大晟极科技有限公司 一种基于深度学习的说话人识别方法
CN110111775B (zh) * 2019-05-17 2021-06-22 腾讯科技(深圳)有限公司 一种流式语音识别方法、装置、设备及存储介质
CN110648658B (zh) * 2019-09-06 2022-04-08 北京达佳互联信息技术有限公司 一种语音识别模型的生成方法、装置及电子设备
CN110634474B (zh) * 2019-09-24 2022-03-25 腾讯科技(深圳)有限公司 一种基于人工智能的语音识别方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100434522B1 (ko) * 1997-04-29 2004-07-16 삼성전자주식회사 시간축 상호관계를 이용한 음성인식 방법
CN105355197A (zh) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 用于语音识别系统的增益处理方法及装置
CN107680597A (zh) * 2017-10-23 2018-02-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110473529A (zh) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 一种基于自注意力机制的流式语音转写系统
CN110556099A (zh) * 2019-09-12 2019-12-10 出门问问信息科技有限公司 一种命令词控制方法及设备
CN110689879A (zh) * 2019-10-10 2020-01-14 中国科学院自动化研究所 端到端语音转写模型的训练方法、系统、装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4145442A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968629A (zh) * 2020-07-08 2020-11-20 重庆邮电大学 一种结合Transformer和CNN-DFSMN-CTC的中文语音识别方法
CN114822540A (zh) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 车辆语音交互方法、服务器和存储介质

Also Published As

Publication number Publication date
EP4145442A4 (en) 2024-04-24
EP4145442A1 (en) 2023-03-08
US20230064756A1 (en) 2023-03-02
CN113593539A (zh) 2021-11-02

Similar Documents

Publication Publication Date Title
CN107291690B (zh) 标点添加方法和装置、用于标点添加的装置
US20200265197A1 (en) Language translation device and language translation method
CN113362812B (zh) 一种语音识别方法、装置和电子设备
WO2021077529A1 (zh) 神经网络模型压缩方法、语料翻译方法及其装置
CN107221330B (zh) 标点添加方法和装置、用于标点添加的装置
US20230064756A1 (en) Streaming End-to-End Speech Recognition Method, Apparatus and Electronic Device
KR102475588B1 (ko) 기계 번역을 위한 신경 네트워크 모델 압축 방법, 장치 및 저장 매체
CN110992942B (zh) 一种语音识别方法、装置和用于语音识别的装置
CN107291704B (zh) 处理方法和装置、用于处理的装置
CN108073572B (zh) 信息处理方法及其装置、同声翻译系统
CN107564526B (zh) 处理方法、装置和机器可读介质
CN113362813B (zh) 一种语音识别方法、装置和电子设备
CN111640424B (zh) 一种语音识别方法、装置和电子设备
JP2019533181A (ja) 通訳装置及び方法(device and method of translating a language)
CN107274903B (zh) 文本处理方法和装置、用于文本处理的装置
CN108364635B (zh) 一种语音识别的方法和装置
CN111583923A (zh) 信息控制方法及装置、存储介质
CN113689879A (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN110415702A (zh) 训练方法和装置、转换方法和装置
CN107424612B (zh) 处理方法、装置和机器可读介质
CN112735396A (zh) 语音识别纠错方法、装置及存储介质
CN114154459A (zh) 语音识别文本处理方法、装置、电子设备及存储介质
CN112017670B (zh) 一种目标账户音频的识别方法、装置、设备及介质
CN108733657B (zh) 神经机器翻译中注意力参数的修正方法、装置及电子设备
KR20130052800A (ko) 음성 인식 서비스를 제공하는 장치 및 그의 오류 발음 검출 능력 향상을 위한 음성 인식 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21796134

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021796134

Country of ref document: EP

Effective date: 20221130