WO2021147417A1 - 语音识别方法、装置、计算机设备及计算机可读存储介质 - Google Patents

语音识别方法、装置、计算机设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021147417A1
WO2021147417A1 PCT/CN2020/123738 CN2020123738W WO2021147417A1 WO 2021147417 A1 WO2021147417 A1 WO 2021147417A1 CN 2020123738 W CN2020123738 W CN 2020123738W WO 2021147417 A1 WO2021147417 A1 WO 2021147417A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
hidden layer
feature
speech
segment
Prior art date
Application number
PCT/CN2020/123738
Other languages
English (en)
French (fr)
Inventor
张玺霖
刘博�
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021147417A1 publication Critical patent/WO2021147417A1/zh
Priority to US17/709,011 priority Critical patent/US20220223142A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • This application relates to the field of speech processing, and in particular to a speech recognition method, device, computer equipment, and computer-readable storage medium.
  • Voice recognition is the process of converting voice data into text output.
  • Voice data is often contextual, and the context information of voice data can make the recognition result more accurate during voice recognition.
  • the delay-controlled two-way long and short-term memory network is usually used to obtain the following information. For example, after the delay-controlled two-way long and short-term memory network obtains the current input voice data, you can Delay processing, after acquiring a piece of the following voice data, the current input voice data is recognized based on the following voice data.
  • the time delay is often short, so the following speech data obtained during the delay period is also less, for example, usually only 300 to 600 milliseconds of speech fragments can be obtained , Resulting in less following information obtained in the speech recognition process, which in turn affects the accuracy of the recognition result.
  • the embodiments of the present application provide a speech recognition method, device, computer equipment, and computer-readable storage medium, which can improve the accuracy of the speech recognition structure.
  • the technical scheme is as follows:
  • a voice recognition method which is applied to a computer device, and the computer device is provided with a voice recognition model, and the method includes:
  • the voice features of the at least two voice segments are input into a voice recognition model, and the voice features of each voice segment are sequentially processed by n cascaded hidden layers in the voice recognition model to obtain the hidden layer features of each voice segment,
  • the hidden layer feature of the i-th speech segment is determined based on the n speech segments located after the i-th speech segment in time sequence and the voice features of the i-th speech segment;
  • the text information corresponding to the voice data is obtained.
  • a voice recognition method which is applied to a terminal, and the method includes:
  • the text information is obtained by using the voice recognition method as described in the previous aspect
  • the text information is displayed on the target page.
  • a speech recognition device is provided, the device is provided with a speech recognition model, and the device includes:
  • the voice acquisition module is used to acquire the voice data to be recognized
  • the voice feature acquisition module is used to perform feature extraction on the voice data to obtain voice features of at least two voice segments in the voice data;
  • the hidden layer feature acquisition module is used to input the speech characteristics of the at least two speech fragments into a speech recognition model.
  • the n cascaded hidden layers in the speech recognition model sequentially process the speech characteristics of each speech fragment to obtain each The hidden layer feature of the speech segment, the hidden layer feature of the i-th speech segment is determined based on the n speech segments after the i-th speech segment in time sequence and the voice features of the i-th speech segment;
  • the text information acquisition module is used to obtain the text information corresponding to the voice data based on the hidden layer features of each voice segment.
  • a speech recognition device which includes:
  • the voice acquisition module is used to obtain real-time input voice data in response to voice input instructions
  • the fragmentation module is used to perform fragmentation processing on the voice data to obtain at least one voice fragment
  • the text information obtaining module is used to obtain the text information corresponding to each voice segment, and the text information is obtained by using the voice recognition method as described in the previous aspect;
  • the display module is used to display the text information on the target page in response to the instruction completed by the voice input.
  • a computer device in another aspect, includes at least one processor and at least one memory. At least one piece of program code is stored in the at least one memory, and the at least one piece of program code is loaded and executed by the at least one processor. To implement the operations performed by the voice recognition method described in the previous aspect, or the operations performed by the voice recognition method described in the other aspect above.
  • a computer-readable storage medium is provided, and at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the voice recognition method as described in the previous aspect The operation performed, or the operation performed by the voice recognition method described in the other aspect above.
  • a computer program product in another aspect, includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device executes the voice recognition method described in the previous aspect or the voice recognition method described in the other aspect. recognition methods.
  • the hidden layer features of each voice segment are learned based on the features of subsequent voice segments. , So that the hidden layer features of the speech fragment can fully learn the following information, and the language expression of the finally recognized text information is more fluent and the semantics are more accurate, which improves the accuracy of speech recognition.
  • Fig. 1 is a schematic diagram of a speech recognition system provided by an embodiment of the present application.
  • Figure 2 is a flowchart of a voice recognition method provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a speech recognition model expanded in time sequence according to an embodiment of the present application
  • Fig. 4 is a schematic diagram of a hidden layer feature extraction principle provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a speech recognition model provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a text information display manner provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a method for displaying text information provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a target page provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a text information display effect provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a voice recognition device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech technology, natural language processing technology, and machine learning/deep learning.
  • the technical solutions provided by the embodiments of the present application involve voice technology, machine learning, and so on.
  • speech Technology includes automatic speech recognition technology (Automatic Speech Recognition, ASR), speech synthesis technology (Text To Speech, TTS), and voiceprint recognition technology. Enabling computers to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice interaction has become one of the most promising human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Enabling computers to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice interaction has become one of the most promising human-computer interaction methods in the future.
  • the technical solutions provided by the embodiments of the present application mainly involve automatic speech recognition technology, which converts voice data into text information through the automatic speech recognition technology.
  • Streaming Speech Recognition Also known as Online Speech Recognition, it is a decoding process that performs speech recognition while receiving voice data in a stream.
  • users can immediately feedback text content during the process of expressing content. The interaction is real-time, and it is suitable for online voice dictation.
  • Speech Feature It is the feature extracted from the input voice data through some signal processing technology, and processed by the acoustic model through the representation of the feature vector to minimize the impact of environmental noise, channel, speaker and other factors on the recognition. Impact.
  • the feature of the frequency spectrum dimension of the voice data is extracted as the voice feature.
  • FIG. 1 is a schematic diagram of a voice recognition system provided by an embodiment of the present application.
  • the voice recognition system 100 includes a terminal 110 and a voice recognition platform 140.
  • the terminal 110 is connected to the voice recognition platform 140 through a wireless network or a wired network.
  • the terminal 110 can be a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) player, and an MP4 (Moving Picture Experts Group Audio) player Layer IV, the dynamic image expert compresses standard audio layer 4) At least one of players and laptop computers.
  • the terminal 110 installs and runs applications that support voice input and voice recognition.
  • the application can be a social application, an instant messaging application, and so on.
  • the terminal 110 is a terminal used by a user, and a user account is logged in an application program running in the terminal 110.
  • the speech recognition platform 140 includes at least one of a server, multiple servers, a cloud computing platform, and a virtualization center.
  • the voice recognition platform 140 is used to provide background services for applications that support voice recognition.
  • the voice recognition platform 140 is responsible for the main voice recognition work, and the terminal 110 is responsible for the secondary voice recognition work; or, the voice recognition platform 140 is responsible for the secondary voice recognition work, and the terminal 110 is responsible for the main voice recognition work; or, the voice recognition platform 140 Or the terminal 110 can separately undertake the voice recognition work.
  • the voice recognition platform 140 includes: an access server, a voice recognition server, and a database.
  • the access server is used to provide access services for the terminal 110.
  • the voice recognition server is used to provide background services related to voice recognition.
  • there are multiple voice recognition servers there are at least two voice recognition servers used to provide different services, and/or, there are at least two voice recognition servers used to provide the same service, for example, to provide the same service in a load balancing manner
  • the embodiment of the present application does not limit this.
  • the voice recognition server may be provided with a voice recognition model, a language model, etc.
  • the terminal 110 may generally refer to one of multiple terminals, and this embodiment only uses the terminal 110 as an example for illustration.
  • the number of the aforementioned terminals may be more or less. For example, there may be only one terminal, or there may be dozens or hundreds, or a greater number of terminals.
  • the voice recognition system may also include other terminals.
  • the embodiments of the present application do not limit the number of terminals and device types.
  • Fig. 2 is a flowchart of a voice recognition method provided by an embodiment of the present application. This method can be applied to the above-mentioned terminal or server.
  • the server is used as the execution subject to introduce the voice recognition method. Referring to Fig. 2, this embodiment may specifically include the following steps:
  • the server obtains voice data to be recognized.
  • the voice data may be voice data input by the user in real time, or a piece of voice data stored in the server, or a piece of voice data intercepted from an audio file or a video file.
  • the type of voice data is not limited.
  • the server may be a background server of a target application, and the target application may support voice input.
  • the target application may be an instant messaging application.
  • the terminal can obtain the voice data input by the user in real time, and send the voice data to the server after processing such as fragmentation and packaging.
  • the server executes the subsequent speech recognition steps.
  • the server After obtaining the voice data to be recognized, the server performs feature extraction on the voice data to obtain the voice features of at least two voice segments in the voice data as input data for the voice recognition model.
  • the voice features of the at least two voice segments are It can be extracted by using the following steps 202 and 203.
  • the server performs spectrum feature extraction on the voice data, and obtains the spectrum feature corresponding to each voice frame in the voice data.
  • the frequency spectrum feature can be used to indicate the change information of the voice data in the frequency domain.
  • the server may also extract features of other dimensions of the voice data, which is not limited in the embodiment of the present application.
  • the extraction of spectral features is taken as an example for description.
  • the process of acquiring the spectrum feature may specifically include the following steps:
  • Step 1 The server preprocesses the voice data.
  • the pre-processing process may include pre-emphasis, framing, windowing, and other processing processes. This and processing process can reduce the influence of aliasing, high-order harmonic distortion, high frequency and other factors caused by the vocal organs and the equipment that collects the voice signal on the quality of the voice data.
  • the server can pre-emphasize the voice data. For example, it can pre-emphasize the voice data through a high-pass filter to emphasize the high frequency part of the voice data and increase The high-frequency resolution facilitates subsequent feature extraction. Then, the server can frame the voice data according to the target duration to obtain multiple voice frames of the voice data, where the target duration can be set by the developer, which is not limited in the embodiment of this application. In the application embodiment, the frame shift of each voice frame may be 10ms. Finally, the server can perform windowing processing on each speech frame to enhance the continuity between each speech frame and its previous and subsequent speech frames. For example, it can apply a window function such as Hamming window to window each speech frame handle.
  • a window function such as Hamming window
  • Step 2 The server extracts spectral features from the preprocessed voice data.
  • the server may obtain the Mel frequency cepstrum coefficient corresponding to the voice data.
  • the server can perform fast Fourier transform on each speech frame to obtain the energy distribution information of each speech frame in the frequency domain, that is, obtain the frequency spectrum of each speech frame, and the server can be based on the The frequency spectrum obtains the power spectrum of each speech frame.
  • the power spectrum of each speech frame can be obtained by taking the model square of the frequency spectrum of each speech frame.
  • the server can pass the power spectrum of each frame through N Mel-scale filter banks.
  • a filter bank can include M triangular filters.
  • the server can obtain the output results of each filter bank and calculate Its logarithmic energy.
  • N and M are both positive integers, and the specific number can be set by the developer, which is not limited in the embodiment of the application.
  • the server can perform discrete cosine transform on each logarithmic energy to obtain the Mel frequency cepstral coefficient, and use the Mel frequency cepstral coefficient as the frequency spectrum feature of the voice data.
  • the spectral characteristics of the voice data may also include the change information of the Mel frequency cepstral coefficients.
  • the server may also calculate the difference spectrum between the Mel frequency cepstral coefficients of each voice frame. Obtain the first-order difference parameter and the second-order difference parameter. Based on the Mel frequency cepstrum coefficient, the first-order difference parameter, and the second-order difference parameter of a speech frame, the frequency spectrum characteristic is determined.
  • the Mel frequency cepstrum coefficient can be used to indicate the static characteristics of the speech data, and the difference spectrum between the static characteristics can be used to indicate the dynamic characteristics of the speech data. The combination of the static characteristics and the dynamic characteristics can improve the accuracy of the speech recognition results. Rate.
  • the server may also obtain volume information of each voice frame, that is, calculate the frame energy of each voice frame, and add the frame energy to the spectrum feature.
  • the spectrum feature may also include information of other dimensions, which is not limited in the embodiment of the present application.
  • the server splices the frequency spectrum features corresponding to at least one voice frame belonging to the same voice segment into one voice feature.
  • a voice segment may include multiple voice frames in the voice data, and the specific number of the voice frames may be set by the developer, which is not limited in the embodiment of the present application.
  • the server may divide the voice data according to the preset duration to obtain multiple voice fragments.
  • a voice fragment may include multiple voice frames, and the voice feature of a voice fragment may be determined by the frequency spectrum of each voice frame. Feature composition.
  • the preset duration can be set by the developer, which is not limited in the embodiment of the present application. Taking the preset duration set to 600ms and the duration of each voice frame to 10ms as an example, each voice segment obtained by the server may include 60 voice frames.
  • the server can copy the last voice frame and add the copied voice frame after the last voice frame to complete the data of the last voice segment.
  • the server can also combine multiple speech frames into one intermediate frame. For example, three speech frames can be combined into one intermediate frame, each intermediate frame has a duration of 30ms, and each 600ms speech segment It can include 20 intermediate frames.
  • the server can copy the last intermediate frame and add the copied intermediate frame after the last intermediate frame to make up for it.
  • the data of the last voice segment does not limit the specific method for determining the voice segment.
  • the voice segment includes multiple intermediate frames as an example for description.
  • the spectral features corresponding to each voice frame can be spliced to obtain the voice feature.
  • the sequence of each voice frame can be Each vector is connected end to end, and a high-dimensional vector is obtained as the voice feature.
  • the server may also apply other methods to perform feature splicing, which is not limited in the embodiment of the present application.
  • the voice feature of the voice segment is obtained through feature splicing.
  • the voice segment is used as a unit to synchronize multiple voice frames in the voice segment, which can improve the efficiency of voice recognition.
  • steps 202 and 203 extract the key information of the voice data by means of feature extraction, and apply the key information in the form of features in the subsequent recognition process to ensure the accuracy of the recognition result.
  • the server inputs the voice features of the at least two voice segments into the voice recognition model.
  • the server is provided with a voice recognition model, where the voice recognition model can be based on the voice features of multiple voice segments after a voice segment, and further feature extraction of the voice segment, so that the features of the voice segment can be merged with multiple voice segments.
  • Information of a voice segment For example, when the server performs feature extraction on a voice segment through the voice recognition model, it can acquire the features of multiple voice segments after the one voice segment in time sequence, and compare the features of the multiple voice segments with the one voice segment. After performing a weighting operation on the features of the one voice segment, the output result corresponding to the one voice segment can be combined with the features of multiple subsequent voice segments.
  • the speech recognition model can be a model built based on TLC-BLSTM (Tri-Latency-Controlled Bidirectional Long-Short-Term Memory, Tri-Latency-Controlled Bidirectional Long-Short-Term Memory).
  • the speech recognition model It can include multiple hidden layers, one hidden layer can be composed of a TLC-BLSTM network, and one hidden layer can include multiple hidden layer units.
  • Each hidden layer unit can memorize each input information and save the input information in Inside the network and used in the current operation process.
  • it may be based on the current input information, the operation result of the previous hidden layer unit in the current hidden layer, and the operation result of a hidden layer unit in the previous hidden layer.
  • FIG 3 is a structural schematic diagram of a speech recognition model expanded in time sequence according to an embodiment of the present application.
  • the speech recognition model may include hidden layers L1, L2, and L3.
  • the input information of the speech recognition model is C1, C2 And C3, for the hidden layer unit 301 in the hidden layer L2, the hidden layer unit 301 may be based on the operation result of the previous hidden layer unit 302 in the current hidden layer L2 and the operation result of the hidden layer unit 303 in the previous hidden layer L1, Get the feature corresponding to C2.
  • the speech recognition model is constructed based on the TLC-BLSTM network, which can support speech segmentation for recognition and decoding, and supports the realization of streaming speech recognition; the TLC-BLSTM network adds the construction of reverse recognition on the basis of forward recognition.
  • the model effectively improves the robustness of the speech recognition model; again, the TLC-BLSTM network adopts a triangular field of view method, as shown in the information transmission path in Figure 3.
  • the transmission path of the calculation results of each hidden layer unit can be regarded as a triangular Hypotenuse, this method of triangular field of view expands the width of the field of view of the reverse LSTM, and fully obtains the following information under the same size of speech fragment.
  • Table 1 is a performance comparison table of a speech recognition model provided by the embodiments of this application.
  • Table 1 shows that the speech recognition models are based on LSTM (Long Short-Term Memory) and BLSTM (Bidirectional Long Short-Term Memory) respectively. , Two-way long and short-term memory network), LC-BLSTM (Latency-Controlled Bidirectional Long-Short-Term Memory, delay control-two-way long and short-term memory network), the performance information of the model constructed by TLC-BLSTM, as shown in Table 1 Performance comparison information in the two dimensions of real-time and robustness.
  • LSTM Long Short-Term Memory
  • BLSTM Bidirectional Long Short-Term Memory
  • the speech recognition model is a model based on TLC-BLSTM, it can well support real-time speech recognition, that is, support streaming operations, and its characteristics based on triangular delay control can effectively broaden the voice signal
  • the field of view maps the voice features from wide to narrow in a step-by-step manner, which effectively takes into account the robustness of the model while ensuring real-time performance.
  • the server obtains the hidden layer features of each voice segment through the voice recognition model.
  • the server may process the voice features of each voice segment through each hidden layer in the speech recognition model to obtain hidden layer features.
  • the first hidden layer uses the speech feature of any speech segment and the speech feature of the following speech segment of any speech segment as Input, take the initial hidden layer feature corresponding to any speech segment as the output; for any intermediate hidden layer in the speech recognition model, the intermediate hidden layer is based on the initial hidden layer feature of any speech segment, the any speech The initial hidden layer feature of the next speech segment of the segment is the input, and the intermediate hidden layer feature corresponding to any speech segment is the output; for the last hidden layer in the speech recognition model, the last hidden layer takes the any The middle hidden layer feature of the speech segment and the middle hidden layer feature of the subsequent speech segment of any speech segment are input, and the hidden layer feature corresponding to any speech segment is the output.
  • the server inputs the voice features of at least two voice segments into the voice recognition model; inputs the voice features of the i-th voice segment and the voice features of the i+1-th voice segment into the first hidden layer in the voice recognition model, Output the initial hidden layer features of the i-th speech segment; input the initial hidden layer features of the i-th speech segment and the initial hidden layer features of the i+1-th speech segment into the first intermediate hidden layer in the speech recognition model, and output The first intermediate hidden layer feature of the i-th speech segment; among them, the initial hidden layer feature of the i+1-th speech segment is the first hidden layer based on the speech feature of the i+1-th speech segment and the i+2th speech segment The voice feature of the speech segment is calculated; input the j-th intermediate hidden layer feature of the i-th speech segment and the j-th intermediate hidden layer feature of the i+1-th speech segment into the j+1-th speech recognition model Intermediate hidden layer, output the j+1th intermediate hidden layer feature of the i-
  • the server can perform forward operations on the feature of any speech segment through the hidden layer to obtain the first feature; Performing a reverse operation on the feature of a voice segment and the feature of the latter voice segment to obtain a second feature; splicing the first feature and the second feature to obtain the feature output by any hidden layer.
  • the server inputs the feature of the i-th speech segment into the k-th hidden layer in the speech recognition model, and performs a forward operation on the feature of the i-th speech segment through the k-th hidden layer to obtain the first feature;
  • the k hidden layers perform inverse operations on the features of the i-th speech segment and the features of the i+1-th speech segment to obtain the second feature; splicing the first feature and the second feature to obtain the output of the i-th hidden layer Features, where k is a positive integer.
  • the second feature can be obtained by the server based on all intermediate frames in any speech segment and some intermediate frames in the latter speech segment.
  • the server can obtain the second feature in the latter speech segment.
  • Two target number voice frames through the any hidden layer, perform inverse operation on any voice segment and the features corresponding to the second target number voice frames to obtain the second feature.
  • the value of the second target number is less than or equal to the total number of voice frames included in a voice segment, and the value of the second target number can be set by the developer, which is not limited in the embodiment of the application.
  • the server obtains the voice frames of the second target number in the i+1th voice segment; through the k-th hidden layer, performs inverse operations on the features corresponding to the i-th voice segment and the second target number of voice frames to obtain the first Two features.
  • any speech segment can be marked as C curr
  • the speech segment located after the speech segment C curr in time sequence is marked as C right
  • Both the speech segment C curr and the speech segment C right can include N c intermediate frames.
  • N c is a positive integer, and its specific value can be set by the developer.
  • FIG. 4 is a schematic diagram of a hidden layer feature extraction principle provided by an embodiment of the present application.
  • the server may obtain the first feature 401 corresponding to the previous voice segment of the voice segment C curr , and a hidden layer unit in the first hidden layer is used to pair the voice based on the first feature 401.
  • the segment C curr performs forward operation.
  • a hidden layer unit can include multiple sub-hidden layer units. Each sub-hidden layer unit performs forward operation on the voice feature corresponding to the speech segment C curr.
  • One sub-hidden layer unit can Obtain the calculation result of the previous sub-hidden layer unit.
  • the sub-hidden layer unit 402 can obtain the calculation result of the previous sub-hidden layer unit 403, and the server can obtain the calculation result of the last sub-hidden layer unit as the speech segment C The first feature of curr.
  • the server may obtain the first N r intermediate frames of the speech clip C right , and according to the time sequence, splicing the N c intermediate frames of the speech clip C curr with the first N r intermediate frames, Get the speech fragment C merge .
  • N r is a positive integer
  • N r is less than or equal to N c , and its specific value can be set by the developer.
  • the server uses multiple sub-hidden layer units to perform reverse LSTM operations on the N c +N r frames in the voice segment C merge , that is, one sub-hidden layer unit can obtain the operation result of the next sub-hidden layer unit, for example, the sub-hidden layer unit
  • the layer unit 404 may obtain the operation result of the latter sub-hidden layer unit 405, and the server may obtain the output result of the first sub-hidden layer unit as the second feature of the speech segment C curr.
  • the server may splice the first feature and the second feature according to the target sequence. For example, when the first feature and the second feature are both expressed as vectors, the server may concatenate the two vectors into a higher-dimensional vector. Of course, the server can also assign different weights to the first feature and the second feature, and concatenate the weighted first feature and the second feature.
  • the embodiment of the present application does not limit which feature splicing method is specifically adopted.
  • steps 204 and 205 are to input the voice features of the at least two voice segments into the voice recognition model, and the n cascaded hidden layers in the voice recognition model sequentially compare the voice features of each voice segment. Perform processing to obtain the hidden layer feature of each speech segment.
  • One hidden layer feature of a speech segment is determined based on the n speech segments that are behind the one speech segment in time sequence and the speech features of the one speech segment, that is,
  • the hidden layer feature of the i-th speech segment is based on the realization of the step of determining the voice features of the i-th speech segment and the n speech segments that are located after the i-th speech segment in time sequence.
  • n is a positive integer
  • n is the number of hidden layers in the speech recognition model, which can be set by the developer, which is not limited in the embodiment of the present application.
  • the hidden layer feature of a speech segment can be determined based on the information of the following three speech segments, and the following three speech segments can also include the information of other speech segments. The embodiment does not limit this.
  • the features of multiple voice segments after the one voice segment can be obtained, and the following information can be fully combined with Recognizing the current speech fragment can improve the accuracy of the recognition result.
  • the server determines the phoneme information corresponding to each speech segment based on the hidden layer feature.
  • the speech recognition model may also include a feature classification layer, which can be used to classify the hidden layer features of each speech segment to obtain the phoneme information corresponding to each speech segment, that is, to obtain the corresponding speech segment The probability value of each phoneme.
  • the feature classification layer may be constructed based on a fully connected layer and a SoftMax function (logistic regression function).
  • the server can input the hidden layer features corresponding to the speech segment into a fully connected layer, and based on the weight parameters in the fully connected layer, map the hidden layer features to a vector, and then map each element in the vector to (0 ,1), that is, a probability vector is obtained, and an element in the probability vector can indicate the probability value of the speech segment corresponding to a certain phoneme.
  • the server may obtain the probability vector corresponding to each speech segment as the phoneme information.
  • Figure 5 is a schematic diagram of a voice recognition model provided by an embodiment of the present application. After the input voice features are processed by multiple hidden layers 501, the server can input the calculation results into the feature classification layer 502, which is classified by the features. The layer 502 outputs the classification result.
  • the server determines text information corresponding to the voice data based on the phoneme information, pronunciation dictionary, and language model.
  • the server is provided with a pronunciation dictionary and a voice model, where the pronunciation dictionary is used to indicate the mapping relationship between phonemes and pronunciations, and the language model is used to determine the probability values corresponding to the various phrases that make up the text information.
  • the server may further process the phoneme information before determining the text information based on the phoneme information, so as to improve the accuracy of speech-to-text conversion.
  • the server can perform forward decoding based on the Bayesian formula, the initial probability matrix of the Hidden Markov Model (Hidden Markov Model, HMM), the transition probability matrix, and the probability vector corresponding to the speech segment to obtain the corresponding input speech segment Hidden state sequence.
  • HMM Hidden Markov Model
  • the embodiment of the present application does not limit the foregoing specific process of forward decoding.
  • the server can obtain text information based on the hidden state sequence, pronunciation dictionary, and language model.
  • the server can construct a WFST (Weighted Finite State Transducer, weighted finite state machine) network based on pronunciation dictionaries, language models, etc., and the WFST can be based on input information, output information, and the weight value of the input to output , Get the text combination corresponding to the voice segment.
  • the input information may be a hidden state sequence corresponding to the speech segment
  • the output information may be a text that may correspond to a phoneme.
  • steps 206 and 207 are steps for obtaining the text information corresponding to the voice data based on the hidden layer features of each voice segment.
  • the foregoing description of obtaining text information based on hidden layer features is only an exemplary description, and the embodiment of the present application does not specifically limit this.
  • the technical solution provided by the embodiments of the present application is to obtain the voice data to be recognized; perform feature extraction on the voice data to obtain the voice features of at least two voice segments in the voice data; and the voice features of the at least two voice segments Input the speech recognition model.
  • the n cascaded hidden layers in the speech recognition model sequentially process the speech features of each speech segment to obtain the hidden layer features of each speech segment.
  • a hidden layer feature of a speech segment is based on time sequence.
  • the n voice segments located after the one voice segment are determined; finally, based on the hidden layer features of each voice segment, the text information corresponding to the voice data is obtained.
  • This method extracts the voice features of at least two voice segments from the voice data, and then calls the voice recognition model to learn and recognize the voice features of each voice segment.
  • the hidden layer features of each voice segment are based on the features of the subsequent voice segments.
  • the method also learns the above information through the forward operation of the signal frame in the speech segment, so that the text information recognized based on the context information is more accurate, and the accuracy of speech recognition is further improved.
  • the spectral feature extraction module and the speech recognition model can form an acoustic model, and the acoustic model, pronunciation dictionary, and language model can form a decoder, and the decoder can be used to recognize streaming speech data.
  • FIG. 6, is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • the decoder 601 may include an acoustic model 602, a pronunciation dictionary 603, and a language model 604.
  • the acoustic model 602 may include a spectral feature extraction module 605 and the Voice recognition model 606.
  • the technical solution provided by the embodiments of the present application constructs a voice recognition model based on the TLC-BLSTM network to realize streaming voice recognition, and can stably output voice recognition services with high recognition rate and low latency.
  • Table 2 shows the speech recognition effect of the model when the speech recognition model is constructed based on LSTM, BLSTM, LC-BLSTM, TLC-BLSTM, including the delay time and the average typo rate.
  • the average typo rate can represent the recognition of every 100 words
  • the number of wrong words, the average word error rate is the average performance of this scheme under multiple clean and noisy test sets.
  • the technical solutions provided by the embodiments of this application not only support streaming speech recognition, but also have the same low latency as other solutions supporting streaming speech recognition, such as LSTM and LC-BLSTM.
  • the average word error rate of this scheme is equivalent to that of the BLSTM model, and is somewhat lower than that of LSTM and LC-BLSTM. This solution can achieve stable and low-latency output, and the recognition result is accurate.
  • a display device may be provided on the server, and the server may display the text information corresponding to the voice data on the target page of the display device; the server also The text information corresponding to the voice data may be sent to the user terminal, and the text information corresponding to the voice data may be displayed on the target page of the user terminal.
  • the target page may be a conversation page, a search page, etc., which is not limited in the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a text information display manner provided by an embodiment of the present application.
  • the server can perform voice recognition on the voice data input by the user on the conversation page 701.
  • the text information 702 corresponding to the voice data is returned to the terminal, and displayed in the conversation display area 703 of the target page 701.
  • this solution can be applied to any scenario where text information needs to be input.
  • this solution can also be applied to a search scenario.
  • searching for information the user does not need to input the search content word by word, just enter the voice Can.
  • the speech recognition technology provided by this solution can effectively improve the efficiency of information input.
  • the above-mentioned speech recognition technology is deployed in a server, and the server may be a background server of the target application, and the server may provide a speech recognition service for the target application.
  • the target application is an instant messaging application
  • the user can input voice information in the terminal in the form of voice input
  • the server converts the voice information into text information to improve the efficiency of information input.
  • FIG. 8 is a flowchart of a method for displaying text information according to an embodiment of the present application. Referring to FIG. 8, the method may specifically include the following steps:
  • the terminal acquires real-time input voice data in response to a voice input instruction.
  • the terminal may be a computer device used by the user.
  • the terminal may be a mobile phone, a computer, etc., and the terminal may install and run the target application.
  • the terminal when the terminal detects the user's triggering operation on the voice input control, it can acquire the voice data input by the user in real time.
  • the trigger operation may be a click operation, a long press operation, etc., which is not limited in the embodiment of the present application.
  • the terminal performs fragmentation processing on the voice data to obtain at least one voice fragment.
  • the terminal may perform segmentation processing on the voice data input by the user according to the target period, where the target period may be set by the developer, which is not limited in the embodiment of the present application.
  • the terminal obtains text information corresponding to each voice segment.
  • the terminal may send a voice recognition request to the server, and the voice recognition request carries the at least one voice fragment.
  • the voice fragment may be packaged using a network protocol.
  • the terminal may receive the text information returned by the server, where the text information is determined by the server based on the hidden layer features of each voice segment in the voice segment.
  • the terminal after the terminal obtains the voice data, it can recognize the voice data in real time to obtain text information.
  • the embodiment of the present application does not limit this.
  • the step of performing voice recognition by the server is taken as an example for description.
  • the terminal displays the text information on the target page in response to the instruction for completing the voice input.
  • the target page may be a conversation page
  • the terminal may display the acquired text information on the conversation page.
  • the conversation page may display a voice input control
  • a voice input window may be displayed on the conversation interface, and the user may display a voice input window in the voice input window.
  • Record voice data when displayed, and the terminal collects voice data through a voice collection device (such as a microphone).
  • the terminal can hide the voice input window, and the terminal can obtain the voice input from the server All corresponding text information is displayed on the conversation page (that is, the target page).
  • the target page 901 may include a voice input window 902, and the voice input window 902 may display recordings.
  • Control 903 when the terminal detects that the user presses down the recording control 903, it can start to collect the voice information input by the user. When the user releases the recording control 903, it can trigger the voice input completion instruction, and the terminal can obtain this Enter the text information corresponding to the voice data this time and display it on the target page.
  • the target page may also be a search page, and the terminal may display the obtained text information on the search page.
  • the user when the user searches, he can input the search content in the form of voice.
  • receive the instruction to complete the voice input obtain the text information corresponding to the voice data input this time. Shown in the search box.
  • an end recording control 905 is displayed in the target page 904.
  • the voice input completion instruction can be triggered, and the terminal can obtain The text information corresponding to the voice data input this time is displayed on the target page, for example, the text information is displayed in the search box.
  • voice input completion instruction can also be triggered in other ways, which is not limited in the embodiment of the present application.
  • the terminal may display text information corresponding to the input voice data in real time during the voice input process of the user.
  • FIG. 10 is a schematic diagram of a text information display effect provided by an embodiment of the present application.
  • the conversation page 1001 may display a voice input window 1002, and the voice input window 1002 may Including the text information display area 1003.
  • the server can return the recognition results of the voice data in the voice segments in real time.
  • the terminal can display the recognition results corresponding to each voice segment in the text information display area 1003 show.
  • the terminal can send the instruction for completing the voice input to the server, and the server ends the voice recognition.
  • the terminal can obtain all the recognition results of this voice recognition, that is, the text information corresponding to the voice data input this time, and the terminal can hide it
  • the voice input window 1002 displays the text information on the conversation page 1001.
  • the target page may be a conversation page or the like.
  • the terminal may display the text information corresponding to the voice data after the voice input is completed. That is, the terminal can obtain all text information from the server based on the voice data input by the user and the instructions completed by the voice input, and display the text information on the target page.
  • the technical solution provided by the embodiments of this application puts the speech recognition solution in the cloud service as a basic technology to empower users using the cloud service, so that users do not need to use pinyin, strokes, etc. when inputting text information
  • the speech recognition solution can perform real-time recognition of streaming speech, that is, real-time input speech, which can shorten the recognition time and improve the efficiency of speech recognition.
  • Fig. 11 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application.
  • the device can be implemented as part or all of a server or a terminal through hardware, software, or a combination of the two.
  • the device includes:
  • the voice acquisition module 1101 is used to acquire voice data to be recognized
  • the voice feature acquisition module 1102 is configured to perform feature extraction on the voice data to obtain voice features of at least two voice segments in the voice data;
  • the hidden layer feature acquisition module 1103 is used to input the voice features of the at least two voice segments into a voice recognition model, and the n cascaded hidden layers in the voice recognition model sequentially process the voice features of each voice segment to obtain The hidden layer feature of each speech segment, the hidden layer feature of the i-th speech segment is determined based on the n speech segments located after the i-th speech segment in time sequence and the voice feature of the i-th speech segment;
  • the text information obtaining module 1104 is configured to obtain text information corresponding to the voice data based on the hidden layer feature of each voice segment.
  • the voice feature acquisition module 1102 is used to:
  • the spectral features corresponding to at least one voice frame belonging to the same voice segment are spliced into one voice feature.
  • the hidden layer feature acquisition module 1103 is used to:
  • the initial hidden layer feature of the i-th speech segment and the initial hidden layer feature of the i+1-th speech segment to the first intermediate hidden layer in the speech recognition model, and output the first intermediate hidden layer of the i-th speech segment Layer features; among them, the initial hidden layer feature of the i+1th speech segment is calculated by the first hidden layer based on the speech feature of the i+1th speech segment and the voice feature of the i+2th speech segment;
  • the feature refers to the initial hidden layer feature.
  • the hidden layer feature acquisition module 1103 is used to:
  • the first feature and the second feature are spliced to obtain the feature output by the i-th hidden layer, where k is a positive integer.
  • the hidden layer feature acquisition module 1103 is used to:
  • a reverse operation is performed on the features corresponding to the i-th speech segment and the second target number of speech frames to obtain the second characteristic.
  • the device is also provided with a pronunciation dictionary and a voice model; the text information acquisition 1104 module is used for:
  • the text information corresponding to the voice data is determined.
  • the pronunciation dictionary is used to indicate the mapping relationship between phonemes and pronunciation
  • the language model is used to determine the probability value corresponding to each phrase that composes the text information.
  • the device further includes:
  • the display module 1105 is configured to display the text information corresponding to the voice data on the target page.
  • the device provided by the embodiment of the present application extracts the voice features of at least two voice segments from the voice data, and then calls the voice recognition model to learn and recognize the voice features of each voice segment.
  • the hidden layer features of each voice segment are based on it.
  • the feature learning of the later speech segment is obtained, so that the hidden layer feature of the speech segment can fully learn the following information, and the language expression of the finally recognized text information is more fluent, the semantics are more accurate, and the accuracy of speech recognition is improved.
  • FIG. 12 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present application. Referring to FIG. 12, the device includes:
  • the voice acquisition module 1201 is configured to acquire real-time input voice data in response to voice input instructions
  • the fragmentation module 1202 is configured to perform fragmentation processing on the voice data to obtain at least one voice fragment
  • the text information obtaining module 1203 is configured to obtain text information corresponding to each voice segment, the text information being obtained by using the voice recognition method according to any one of claims 1 to 6;
  • the display module 1204 is configured to display the text information on the target page in response to an instruction to complete the voice input.
  • the text information acquisition module 1203 is used to:
  • the server extracts the voice features of at least two voice segments from the voice data, and then calls the voice recognition model to learn and recognize the voice features of each voice segment.
  • the hidden layer feature of each voice segment is Based on the feature learning of the subsequent speech segment, the hidden layer feature of the speech segment can fully learn the following information, and the language expression of the finally recognized text information is more fluent and the semantics are more accurate, which improves the accuracy of speech recognition.
  • the voice recognition device provided in the above embodiment only uses the division of the above functional modules to illustrate during voice recognition.
  • the above functions can be allocated by different functional modules according to needs, i.e.
  • the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the voice recognition device provided by the above-mentioned embodiment belongs to the same concept as the voice recognition method embodiment, and its specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG. 13 is a schematic structural diagram of a terminal provided in an embodiment of the present application.
  • the terminal 1300 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture expert compressing standard audio Level 4) Player, laptop or desktop computer.
  • the terminal 1300 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 1300 includes: one or more processors 1301 and one or more memories 1302.
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 1301 may adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). accomplish.
  • the processor 1301 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 1301 may further include an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 1302 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1302 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one piece of program code, and the at least one piece of program code is used to be executed by the processor 1301 to implement the method provided in the method embodiment of the present application. Voice recognition method.
  • the terminal 1300 may optionally further include: a peripheral device interface 1303 and at least one peripheral device.
  • the processor 1301, the memory 1302, and the peripheral device interface 1303 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1304, a display screen 1305, a camera component 1306, an audio circuit 1307, a positioning component 1308, and a power supply 1309.
  • the peripheral device interface 1303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 1301 and the memory 1302.
  • the processor 1301, the memory 1302, and the peripheral device interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1301, the memory 1302, and the peripheral device interface 1303 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 1304 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1304 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 1304 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, and so on.
  • the radio frequency circuit 1304 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity, wireless fidelity) networks.
  • the radio frequency circuit 1304 may also include a circuit related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 1305 is used to display UI (User Interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 1305 also has the ability to collect touch signals on or above the surface of the display screen 1305.
  • the touch signal may be input to the processor 1301 as a control signal for processing.
  • the display screen 1305 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 1305 may be one display screen 1305, which is provided with the front panel of the terminal 1300; in other embodiments, there may be at least two display screens 1305, which are respectively arranged on different surfaces of the terminal 1300 or in a folded design;
  • the display screen 1305 may be a flexible display screen, which is disposed on the curved surface or the folding surface of the terminal 1300.
  • the display screen 1305 can also be set as a non-rectangular irregular pattern, that is, a special-shaped screen.
  • the display screen 1305 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the camera assembly 1306 is used to collect images or videos.
  • the camera assembly 1306 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera assembly 1306 may also include a flash.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • the audio circuit 1307 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 1301 for processing, or input to the radio frequency circuit 1304 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively set in different parts of the terminal 1300.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 1301 or the radio frequency circuit 1304 into sound waves.
  • the speaker can be a traditional thin-film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into human audible sound waves, but also convert the electrical signal into human inaudible sound waves for distance measurement and other purposes.
  • the audio circuit 1307 may also include a headphone jack.
  • the positioning component 1308 is used to locate the current geographic location of the terminal 1300 to implement navigation or LBS (Location Based Service, location-based service).
  • the positioning component 1308 may be a positioning component based on the GPS (Global Positioning System, Global Positioning System) of the United States, the Beidou system of China, the Granus system of Russia, or the Galileo system of the European Union.
  • the power supply 1309 is used to supply power to various components in the terminal 1300.
  • the power source 1309 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the terminal 1300 further includes one or more sensors 1310.
  • the one or more sensors 1310 include, but are not limited to: an acceleration sensor 1311, a gyroscope sensor 1312, a pressure sensor 1313, a fingerprint sensor 1314, an optical sensor 1315, and a proximity sensor 1316.
  • the acceleration sensor 1311 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 1300.
  • the acceleration sensor 1311 can be used to detect the components of gravitational acceleration on three coordinate axes.
  • the processor 1301 may control the display screen 1305 to display the user interface in a horizontal view or a vertical view according to the gravitational acceleration signal collected by the acceleration sensor 1311.
  • the acceleration sensor 1311 can also be used for the collection of game or user motion data.
  • the gyroscope sensor 1312 can detect the body direction and rotation angle of the terminal 1300, and the gyroscope sensor 1312 can cooperate with the acceleration sensor 1311 to collect the user's 3D actions on the terminal 1300.
  • the processor 1301 can implement the following functions according to the data collected by the gyroscope sensor 1312: motion sensing (for example, changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 1313 may be arranged on the side frame of the terminal 1300 and/or the lower layer of the display screen 1305.
  • the processor 1301 performs left and right hand recognition or quick operation according to the holding signal collected by the pressure sensor 1313.
  • the processor 1301 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1305.
  • the operability control includes at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 1314 is used to collect the user's fingerprint.
  • the processor 1301 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the user's identity according to the collected fingerprint.
  • the processor 1301 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 1314 may be provided on the front, back or side of the terminal 1300. When a physical button or a manufacturer logo is provided on the terminal 1300, the fingerprint sensor 1314 can be integrated with the physical button or the manufacturer logo.
  • the optical sensor 1315 is used to collect the ambient light intensity.
  • the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the display screen 1305 is decreased.
  • the processor 1301 may also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.
  • the proximity sensor 1316 also called a distance sensor, is usually arranged on the front panel of the terminal 1300.
  • the proximity sensor 1316 is used to collect the distance between the user and the front of the terminal 1300.
  • the processor 1301 controls the display screen 1305 to switch from the on-screen state to the off-screen state; when the proximity sensor 1316 detects When the distance between the user and the front of the terminal 1300 gradually increases, the processor 1301 controls the display screen 1305 to switch from the screen-on-screen state to the screen-on state.
  • FIG. 13 does not constitute a limitation on the terminal 1300, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
  • the server 1400 may have relatively large differences due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1401 and one Or multiple memories 1402, wherein at least one program code is stored in the one or more memories 1402, and the at least one program code is loaded and executed by the one or more processors 1401 to implement the methods provided in the foregoing various method embodiments method.
  • the server 1400 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output.
  • the server 1400 may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium such as a memory including at least one piece of program code, and the aforementioned at least one piece of program code can be executed by a processor to complete the voice recognition method in the aforementioned embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Magnetic tapes, floppy disks and optical data storage devices, etc.

Abstract

一种语音识别方法、装置、计算机设备及计算机可读存储介质,该方法包括:获取待识别的语音数据;对语音数据进行特征提取得到至少两个语音片段的语音特征;将至少两个语音片段的语音特征输入语音识别模型,由语音识别模型中级联的隐层依次对各个语音片段的语音特征进行处理,得到各个语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于第i个语音片段之后的n个语音片段以及第i个语音片段的语音特征确定的;基于各个语音片段的隐层特征得到语音数据对应的文本信息。该方法中语音片段对应的文本信息是结合其之后的多个语音片段来确定的,在语音识别过程中充分获取到了下文信息,提高了语音识别的准确率。

Description

语音识别方法、装置、计算机设备及计算机可读存储介质
本申请要求于2020年01月22日提交的申请号为202010074075.X、发明名称为“语音识别方法、装置、计算机设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理领域,特别涉及一种语音识别方法、装置、计算机设备及计算机可读存储介质。
背景技术
语音识别,是将语音数据转换为文本输出的过程。语音数据往往具有上下文联系,在语音识别时利语音数据的上下文信息可以使识别结果更加准确。对于实时输入的语音数据,在进行语音识别时,通常应用延时控制的双向长短时记忆网络,来获取下文信息,例如,延时控制的双向长短时记忆网络获取到当前输入语音数据后,可以延时处理,在获取到一段下文语音数据后,再基于下文语音数据对当前输入语音数据进行识别。
但是,为保证语音识别的实时性,延时的时长往往较短,因而在延时时段内所获取到的下文语音数据也较少,例如,通常情况下仅能获取300至600毫秒的语音片段,导致语音识别过程中获取到的下文信息较少,进而影响识别结果的准确率。
发明内容
本申请实施例提供了一种语音识别方法、装置、计算机设备及计算机可读存储介质,可以提高语音识别结构的准确率。该技术方案如下:
一方面,提供了一种语音识别方法,应用于计算机设备中,该计算机设备中设置有语音识别模型,该方法包括:
获取待识别的语音数据;
对该语音数据进行特征提取,得到该语音数据中至少两个语音片段的语音特征;
将该至少两个语音片段的语音特征输入语音识别模型,由该语音识别模型 中n个级联的隐层依次对各个该语音片段的语音特征进行处理,得到各个该语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于第i个语音片段之后的n个语音片段以及第i个语音片段的语音特征确定的;
基于各个该语音片段的隐层特征,得到该语音数据对应的文本信息。
另一方面,提供了一种语音识别方法,应用于终端中,该方法包括:
响应于语音输入指令,获取实时输入的语音数据;
对该语音数据进行分片处理,得到至少一个语音分片;
获取各个语音分片对应的文本信息,该文本信息是采用如上一个方面所述的语音识别方法得到的;
响应于语音输入完成的指令,在目标页面显示该文本信息。
一方面,提供了一种语音识别装置,该装置中设置有语音识别模型,该装置包括:
语音获取模块,用于获取待识别的语音数据;
语音特征获取模块,用于对该语音数据进行特征提取,得到该语音数据中至少两个语音片段的语音特征;
隐层特征获取模块,用于将该至少两个语音片段的语音特征输入语音识别模型,由该语音识别模型中n个级联的隐层依次对各个该语音片段的语音特征进行处理,得到各个该语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于第i个语音片段之后的n个语音片段以及第i个语音片段的语音特征确定的;
文本信息获取模块,用于基于各个该语音片段的隐层特征,得到该语音数据对应的文本信息。
另一方面,提供了一种语音识别装置,该装置包括:
语音获取模块,用于响应于语音输入指令,获取实时输入的语音数据;
分片模块,用于对该语音数据进行分片处理,得到至少一个语音分片;
文本信息获取模块,用于获取各个语音分片对应的文本信息,该文本信息是采用如上一个方面所述的语音识别方法得到的;
显示模块,用于响应于语音输入完成的指令,在目标页面显示该文本信息。
另一方面,提供了一种计算机设备,该计算机设备包括至少一个处理器和至少一个存储器,该至少一个存储器中存储有至少一条程序代码,该至少一条程序代码由该至少一个处理器加载并执行以实现如上一个方面所述的语音识别 方法所执行的操作,或如上另一个方面所述的语音识别方法所执行的操作。
另一方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以实现如上一个方面所述的语音识别方法所执行的操作,或如上另一个方面所述的语音识别方法所执行的操作。
另一个方面,提供了一种计算机程序产品,上述计算机程序产品包括计算机指令,上述计算机指令存储在计算机可读存储介质中。计算机设备的处理器从上述计算机可读存储介质读取上述计算机指令,上述处理器执行上述计算机指令,使得上述计算机设备执行如上一个方面所述的语音识别方法,或如上另一个方面所述的语音识别方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
从语音数据中提取出至少两个语音片段的语音特征,之后调用语音识别模型对各个语音片段的语音特征进行学习识别,每一个语音片段的隐层特征是基于其后的语音片段的特征学习得到的,使得该语音片段的隐层特征充分学习到了下文信息,最终识别出的文本信息的语言表达更加通顺、语义更加准确,提高了语音识别的准确率。
附图说明
图1是本申请实施例提供的一种语音识别系统的示意图;
图2是本申请实施例提供的一种语音识别方法的流程图;
图3是本申请实施例提供的一种语音识别模型按时序展开的结构示意图;
图4是本申请实施例提供的一种隐层特征提取原理示意图;
图5是本申请实施例提供的一种语音识别模型的示意图;
图6是本申请实施例提供的一种解码器结构示意图;
图7是本申请实施例提供的一种文本信息显示方式示意图;
图8是本申请实施例提供的一种文本信息显示方法的流程图;
图9是本申请实施例提供的一种目标页面示意图;
图10是本申请实施例提供的一种文本信息显示效果示意图;
图11是本申请实施例提供的一种语音识别装置的结构示意图;
图12是本申请实施例提供的一种语音识别装置的结构示意图;
图13是本申请实施例提供的一种终端的结构示意图;
图14是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
首先,对本申请涉及的若干个名词进行解释:
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音技术、自然语言处理技术以及机器学习/深度学习等几大方向。本申请实施例提供的技术方案涉及语音技术、机器学习等。
其中,语音技术(Speech Technology)的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)、语音合成技术(Text To Speech,TTS)、以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音交互成为未来最被看好的人机交互方式之一。本申请实施例提供的技术方案主要涉及自动语音识别技术,通过自动语音识别技术将语音数据转换为文本信息。
流式语音识别(Streaming Speech Recognition):亦称为在线语音识别(Online Speech Recognition),是一种在流式接收语音数据的同时,进行语音识别的解码流程。应用流式语音识别技术,在用户表达内容的过程可以立即反馈文本内容,交互具有实时性,适用于在线语音听写。
语音特征(Speech Feature):是通过一些信号处理技术从输入的语音数据中提取的特征,通过特征向量的表示形式供声学模型处理,以尽可能降低环境噪声、信道、说话人等因素对识别造成的影响。在本申请中提取语音数据的频谱维度的特征作为语音特征。
图1是本申请实施例提供的一种语音识别系统的示意图,参见图1,该语音识别系统100包括:终端110和语音识别平台140。
终端110通过无线网络或有线网络与语音识别平台140相连。终端110可以是智能手机、游戏主机、台式计算机、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器和膝上型便携计算机中的至少一种。终端110安装和运行有支持语音输入、语音识别的应用程序。该应用程序可以是社交类应用程序、即时通讯类应用程序等。示例性的,终端110是用户使用的终端,终端110中运行的应用程序内登录有用户账号。
语音识别平台140包括一台服务器、多台服务器、云计算平台和虚拟化中心中的至少一种。语音识别平台140用于为支持语音识别的应用程序提供后台服务。可选地,语音识别平台140承担主要语音识别工作,终端110承担次要语音识别工作;或者,语音识别平台140承担次要语音识别工作,终端110承担主要语音识别工作;或者,语音识别平台140或终端110分别可以单独承担语音识别工作。
可选地,语音识别平台140包括:接入服务器、语音识别服务器和数据库。接入服务器用于为终端110提供接入服务。语音识别服务器用于提供语音识别有关的后台服务。语音识别服务器可以是一台或多台。当语音识别服务器是多台时,存在至少两台语音识别服务器用于提供不同的服务,和/或,存在至少两台语音识别服务器用于提供相同的服务,比如以负载均衡方式提供同一种服务,本申请实施例对此不加以限定。语音识别服务器中可以设置有语音识别模型、语言模型等。
终端110可以泛指多个终端中的一个,本实施例仅以终端110来举例说明。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述语音识别系统还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
图2是本申请实施例提供的一种语音识别方法的流程图。该方法可以应用 于上述终端或者服务器,在本申请实施例中,以服务器作为执行主体对该语音识别方法进行介绍,参见图2,该实施例具体可以包括以下步骤:
201、服务器获取待识别的语音数据。
其中,该语音数据可以为用户实时输入的语音数据,也可以为存储在服务器中的一段语音数据,还可以为从音频文件、视频文件中截取的一段语音数据,本申请实施例对具体采用哪种语音数据不作限定。
在本申请实施例中,该服务器可以为目标应用程序的后台服务器,该目标应用程序可以支持语音输入,例如,该目标应用程序可以为即时通讯类应用程序等。在一种可能实现方式中,当用户在任一终端通过该目标应用程序进行语音输入时,终端可以获取用户实时输入的语音数据,对语音数据进行分片、打包等处理后,发送至该服务器,由该服务器执行后续的语音识别步骤。
服务器在获取得到待识别的语音数据之后,对语音数据进行特征提取,得到语音数据中的至少两个语音片段的语音特征,以作为语音识别模型的输入数据,上述至少两个语音片段的语音特征可以采用如下步骤202与步骤203提取得到。
202、服务器对该语音数据进行频谱特征提取,得到该语音数据中各个语音帧对应的频谱特征。
其中,该频谱特征可以用于指示语音数据在频域的变化信息。当然,该服务器也可以提取该语音数据其他维度的特征,本申请实施例对此不作限定,在本申请实施例中,以提取频谱特征为例进行说明。在一种可能实现方式中,该频谱特征的获取过程具体可以包括以下步骤:
步骤一、服务器对语音数据进行预处理。
在本申请实施例中,该预处理过程可以包括预加重、分帧、加窗等处理过程。该与处理过程可以降低由于发声器官、采集语音信号的设备所带来的混叠、高次谐波失真、高频等因素,对语音数据质量的影响。
在一种可能实现方式中,首先,该服务器可以对该语音数据进行预加重,例如,可以通过一个高通滤波器对语音数据进行预加重处理,以对语音数据中的高频部分进行加重,增加高频分辨率,便于后续进行特征提取。然后,该服务器可以按照目标时长,对该语音数据进行分帧,得到该语音数据的多个语音帧,其中,该目标时长可以由开发人员进行设置,本申请实施例对此不作限定,在本申请实施例中,各个语音帧的帧移可以为10ms。最后,该服务器可以对各 个语音帧进行加窗处理,以增强各个语音帧与其前一个、后一个语音帧之间的连续性,例如,可以应用汉明窗等窗函数对各个语音帧进行加窗处理。
需要说明的是,上述对语音数据预处理过程的描述,仅是一种示例性描述,本申请实施例对具体采用哪种预处理方法不作限定。
步骤二、服务器对预处理后的语音数据提取频谱特征。
在本申请实施例中,该服务器可以获取语音数据对应的梅尔频率倒谱系数。
在一种可能实现方式中,首先,该服务器可以对各个语音帧进行快速傅里叶变换,得到各个语音帧在频域上的能量分布信息,即得到各个语音帧的频谱,该服务器可以基于该频谱得到各个语音帧的功率谱,例如,可以对各个语音帧的频谱取模型平方,得到该功率谱。然后,该服务器可以将各个帧的功率谱通过N个Mel(梅尔)尺度的滤波器组,一个滤波器组可以包括M个三角形滤波器,该服务器可以获取各个滤波器组的输出结果,计算其对数能量。其中,N和M均为正整数,其具体数目可以由开发人员进行设置,本申请实施例对此不作限定。最后,该服务器可以对各个对数能量进行离散余弦变换,得到该梅尔频率倒谱系数,将该梅尔频率倒谱系数作为语音数据的频谱特征。
需要说明的是,上述对梅尔频率倒谱系数获取方法的说明,仅是一种示例性说明,本申请实施例对此不作具体限定。
在一种可能实现方式中,语音数据的频谱特征中还可以包括梅尔频率倒谱系数的变化信息,例如,该服务器还可以计算各个语音帧的梅尔频率倒谱系数之间的差分谱,得到一阶差分参数和二阶差分参数。基于一个语音帧的梅尔频率倒谱系数、一阶差分参数、二阶差分参数,确定该频谱特征。该梅尔频率倒谱系数可以用于指示语音数据的静态特征,静态特征之间的差分谱可以用于指示语音数据的动态特征,将静态特征和动态特征相结合,可以提高语音识别结果的准确率。在一种可能实现方式中,该服务器还可以获取各个语音帧的音量信息,即计算各个语音帧的帧能量,将该帧能量添加至该频谱特征中。当然,该频谱特征中还可以包括其他维度的信息,本申请实施例对此不作限定。
需要说明的是,上述对频谱特征获取方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取该频谱特征不作限定。
203、服务器将属于同一个语音片段的至少一个语音帧所对应的频谱特征,拼接为一个语音特征。
其中,一个语音片段可以包括语音数据中的多个语音帧,其具体数目可以 由开发人员进行设置,本申请实施例对此不作限定。在一种可能实现方式中,该服务器可以按照预设时长对语音数据进行划分,得到多个语音片段,一个语音片段可以包括多个语音帧,一个语音片段的语音特征可以由各个语音帧的频谱特征组成。其中,该预设时长可以由开发人员进行设置,本申请实施例对此不作限定。以该预设时长设置为600ms,每个语音帧时长为10ms为例,该服务器获取到的各个语音片段可以包括60个语音帧。当最后一个语音片段中的语音帧数目不足60时,该服务器可以对最后一个语音帧进行复制,将复制出的语音帧添加在该最后一个语音帧之后,以补齐最后一个语音片段的数据。在一种可能实现方式中,该服务器还可以将多个语音帧组合为一个中间帧,例如,可以将三个语音帧组成一个中间帧,每一个中间帧时长为30ms,每个600ms的语音片段可以包括20个中间帧,当最后一个语音片段所包含的中间帧数量不足时,该服务器可以对最后一个中间帧进行复制,将复制出的中间帧添加在该最后一个中间帧之后,以补齐最后一个语音片段的数据。本申请实施例对语音片段的具体确定方法不作限定,在本申请实施例中,以该语音片段中包括多个中间帧为例进行说明。
在本申请实施例中,可以基于各个语音帧的时序顺序,对各个语音帧对应的频谱特征进行拼接,得到语音特征,例如,当频谱特征表示为向量时,可以按照各个语音帧的时序,将各个向量首尾相接,得到一个高维度的向量作为该语音特征。当然,该服务器还可以应用其他方法进行特征拼接,本申请实施例对此不作限定。
在本申请实施例中,通过特征拼接得到语音片段的语音特征,在后续语音识别过程中,以语音片段为单位,对语音片段中的多个语音帧同步处理,可以提高语音识别效率。
需要说明的是,上述步骤202和步骤203通过特征提取的方式提取出语音数据的关键信息,将关键信息以特征的形式应用于后续的识别过程中,确保识别结果准确。
204、服务器将该至少两个语音片段的语音特征输入该语音识别模型。
服务器中设置有语音识别模型,其中,该语音识别模型可以基于一个语音片段之后的多个语音片段的语音特征,对该一个语音片段进一步特征提取,使该一个语音片段的特征可以融合之后的多个语音片段的信息。例如,该服务器在通过该语音识别模型对一个语音片段进行特征提取时,可以获取时序上位于 该一个语音片段之后的多个语音片段的特征,将该多个语音片段的特征与该一个语音片段的特征进行加权运算,则该一个语音片段对应的输出结果可以融合其之后的多个语音片段的特征。
在一种可能实现方式中,该语音识别模型可以为基于TLC-BLSTM(Tri-Latency-Controlled Bidirectional Long Short-Term Memory,三角延时控制-双向长短时记忆网络)构建的模型,该语音识别模型可以包括多个隐层,一个隐层可以由一个TLC-BLSTM网络构成,一个隐层可以包括多个隐层单元,各个隐层单元可以对每次的输入信息进行记忆,将该输入信息保存在网络内部,并应用于当前运算过程中。在本申请实施例中,对于一个隐层中的任一个隐层单元,可以基于当前输入信息、当前隐层中前一隐层单元的运算结果、前一隐层中一个隐层单元的运算结果进行加权运算,得到当前输入信息对应的特征。参见图3,图3是本申请实施例提供的一种语音识别模型按时序展开的结构示意图,该语音识别模型可以包括隐层L1、L2和L3,该语音识别模型的输入信息为C1、C2和C3,对于隐层L2中的隐层单元301,该隐层单元301可以基于当前隐层L2中前一隐层单元302的运算结果、前一隐层L1中隐层单元303的运算结果,得到C2对应的特征。
在本申请实施例中,基于TLC-BLSTM网络构建语音识别模型,可以支持语音分片进行识别解码,支持流式语音识别的实现;TLC-BLSTM网络在正向识别的基础上增添逆向识别的建模,有效提高了语音识别模型的鲁棒性;再次,TLC-BLSTM网络通过三角视野的方法,如图3中的信息传输路径所示,各个隐层单元的运算结果的传输路径可以作为三角形的斜边,这种三角视野的方法拓展了逆向LSTM的视野宽度,在同样大小的语音片段下,充分获取下文信息。表1是本申请实施例提供的一种语音识别模型的性能对比表,表1展示了语音识别模型分别基于LSTM(Long Short-Term Memory,长短时记忆网络)、BLSTM(Bidirectional Long Short-Term Memory,双向长短时记忆网络)、LC-BLSTM(Latency-Controlled Bidirectional Long Short-Term Memory,延时控制-双向长短时记忆网络)、TLC-BLSTM所构建的模型的性能信息,表1中示出了实时性和鲁棒性两个维度的性能对比信息。
表1
模型 LSTM BLSTM LC-BLSTM TLC-BLSTM
实时性
鲁棒性
基于表1中的信息,语音识别模型为基于TLC-BLSTM构建的模型时,可以良好的支持实时语音识别,即支持流式运算,而且其基于三角延时控制的特性,可以有效拓宽语音信号的视野,对语音特征由宽到窄、循序渐进地映射建模,在保证实时性的同时有效兼顾了模型的鲁棒性。
205、服务器通过该语音识别模型获取各个语音片段的隐层特征。
在本申请实施例中,该服务器可以通过该语音识别模型中的各个隐层,对各个语音片段的语音特征进行处理,得到隐层特征。
在一种可能实现方式中,对于该语音识别模型中的第一个隐层,该第一个隐层以任一语音片段的语音特征、该任一语音片段的后一个语音片段的语音特征为输入,以该任一语音片段对应的初始隐层特征为输出;对于该语音识别模型中的任一个中间隐层,该中间隐层以该任一语音片段的初始隐层特征、该任一语音片段的后一个语音片段的初始隐层特征为输入,以该任一语音片段对应的中间隐层特征为输出;对于该语音识别模型中的最后一个隐层,该最后一个隐层以该任一语音片段的中间隐层特征、该任一语音片段的后一个语音片段的中间隐层特征为输入,以该任一语音片段对应的隐层特征为输出。
也即,服务器将至少两个语音片段的语音特征输入语音识别模型;将第i个语音片段的语音特征和第i+1个语音片段的语音特征输入语音识别模型中的第一个隐层,输出第i个语音片段的初始隐层特征;将第i个语音片段的初始隐层特征和第i+1个语音片段的初始隐层特征输入语音识别模型中的第一个中间隐层,输出第i个语音片段的第一个中间隐层特征;其中,第i+1个语音片段的初始隐层特征是第一个隐层基于第i+1个语音片段的语音特征和第i+2个语音片段的语音特征运算得到的;将第i个语音片段的第j个中间隐层特征和第i+1个语音片段的第j个中间隐层特征输入语音识别模型中的第j+1个中间隐层,输出第i个语音片段的第j+1个中间隐层特征;其中,第i+1个语音片段的第j个中间隐层特征是第j个中间隐层基于第i+1个语音片段的第j-1个中间隐层特征和第i+2个语音片段的第j-1个中间隐层特征运算得到的;将第i个语音片段的最后一个中间隐层特征输入语音识别模型中的最后一个隐层,输出第i个语音片段的隐层特征;其中,i、j为正整数,第0个中间隐层特征是指初始隐层特征。
其中,对于该语音识别模型中的任一个隐层,该服务器可以通过该任一个隐层对该任一语音片段的特征进行正向运算,得到第一特征;通过该任一个隐 层对该任一语音片段的特征、该后一语音片段的特征进行逆向运算,得到第二特征;对该第一特征和该第二特征进行拼接,得到该任一个隐层输出的特征。
也即,服务器将第i个语音片段的特征输入语音识别模型中的第k个隐层,通过第k个隐层对第i个语音片段的特征进行正向运算,得到第一特征;通过第k个隐层对第i个语音片段的特征、第i+1个语音片段的特征进行逆向运算,得到第二特征;对第一特征和第二特征进行拼接,得到第i个隐层输出的特征,其中,k为正整数。
在一种可能实现方式中,该第二特征可以由服务器基于任意语音片段中的全部中间帧、该后一语音片段中的部分中间帧得到,例如,该服务器可以获取该后一语音片段中第二目标数量个语音帧,通过该任一个隐层对该任一语音片段、该第二目标数量个语音帧对应的特征进行逆向运算,得到该第二特征。其中,该第二目标数量的数值小于或等于一个语音片段所包含语音帧的总帧数,该第二目标数量的数值可以由开发人员进行设置,本申请实施例对此不做限定。
也即,服务器获取第i+1个语音片段中第二目标数量个语音帧;通过第k个隐层对第i个语音片段、第二目标数量个语音帧对应的特征进行逆向运算,得到第二特征。
具体地,以该语音识别模型中的第一个隐层为例对上述隐层特征的提取过程进行说明。在本申请实施例中,可以将任一语音片段标记为C curr,将时序上位于该语音片段C curr之后的语音片段标记为C right,该语音片段C curr和语音片段C right均可以包括N c个中间帧。其中,N c为正整数,其具体数值可以由开发人员进行设置。参见图4,图4是本申请实施例提供的一种隐层特征提取原理示意图。在一种可能实现方式中,该服务器可以获取该语音片段C curr的前一个语音片段对应的第一特征401,由第一个隐层中的一个隐层单元,基于该第一特征401对语音片段C curr进行正向运算,具体地,一个隐层单元中可以包括多个子隐层单元,由各个子隐层单元对语音片段C curr对应的语音特征进行正向运算,一个子隐层单元可以获取前一个子隐层单元的运算结果,例如,子隐层单元402可以获取前一个子隐层单元403的运算结果,该服务器可以获取最后一个子隐层单元的运算结果,作为该语音片段C curr的第一特征。在一种可能实现方式中,该服务器可以获取语音片段C right的前N r个中间帧,按照时序,将该语音片段C curr的N c个中间帧与该前N r个中间帧进行拼接,得到语音片段C merge。其中,N r为正整数,N r小于或者等于N c,其具体数值可以由开发人员进行设置。该服务 器通过多个子隐层单元对语音片段C merge中的N c+N r帧进行逆向LSTM运算,也即是一个子隐层单元可以获取后一个子隐层单元的运算结果,例如,子隐层单元404可以获取后一个子隐层单元405的运算结果,该服务器可以获取最前一个子隐层单元的输出结果,作为该语音片段C curr的第二特征。
在本申请实施例中,该服务器可以按照目标顺序对该第一特征和该第二特征进行拼接。例如,当该第一特征和该第二特征均表示为向量时,该服务器可以将两个向量拼接为一个更高维度的向量。当然,该服务器还可对该第一特征和该第二特征赋予不同的权重,将加权后的该第一特征和该第二特征进行拼接。本申请实施例对具体采用哪种特征拼接方式不作限定。
需要说明的是,上述对获取隐层特征的方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取各个语音帧的隐层特征不作限定。
需要说明的是,上述步骤204和步骤205,是将该至少两个语音片段的语音特征输入语音识别模型,由该语音识别模型中n个级联的隐层依次对各个该语音片段的语音特征进行处理,得到各个该语音片段的隐层特征,一个语音片段的一个隐层特征是基于时序上位于该一个语音片段之后的该n个语音片段以及该一个语音片段的语音特征确定的,也即第i个语音片段的隐层特征是基于时序上位于第i个语音片段之后的n个语音片段以及第i个语音片段的语音特征确定的这一步骤的实现。其中,n为正整数,n也即是语音识别模型中隐层的数量,可以由开发人员进行设置,本申请实施例对此不作限定。例如,当语音识别模型包含三个隐层时,一个语音片段的隐层特征可以基于其后三个语音片段的信息确定,而其后三个语音片段还可以包括其他语音片段的信息,本申请实施例对此不作限定。在本申请实施例中,在对一个语音片段进行语音识别时,通过多个级联的隐层进行逆向运算,可以获取到该一个语音片段之后的多个语音片段的特征,充分结合下文信息对当前语音片段进行识别,进而可以提高识别结果的准确性。
206、服务器基于该隐层特征,确定各个该语音片段对应的音素信息。
本申请实施例中,该语音识别模型还可以包括一个特征分类层,该特征分类层可以用于对各个语音片段的隐层特征进行分类,得到各个语音片段对应的音素信息,即得到语音片段对应于各个音素的概率值。在一种可能实现方式中,该特征分类层可以是基于全连接层和SoftMax函数(逻辑回归函数)构建的。该服务器可以将语音片段对应的隐层特征输入一个全连接层,基于全连接层中 的权重参数,将隐层特征映射为一个向量,再通过SoftMax函数将该向量中的各个元素映射为(0,1)中的数值,即得到一个概率向量,该概率向量中的一个元素可以指示语音片段对应于某一个音素的概率值。该服务器可以获取各个语音片段对应的概率向量作为该音素信息。参见图5,图5是本申请实施例提供的一种语音识别模型的示意图,输入的语音特征经过多个隐层501的运算后,该服务器可以将运算结果输入特征分类层502,由特征分类层502输出分类结果。
需要说明的是,上述对获取音素信息的方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取该音素信息不作限定。
207、服务器基于该音素信息、发音词典以及语言模型,确定该语音数据对应的文本信息。
服务器中设置有发音词典和语音模型,其中,该发音词典用于指示音素与发音之间的映射关系,该语言模型用于确定组成该文本信息的各个词组所对应的概率值。
在一种可能实现方式中,该服务器在基于该音素信息确定文本信息前,可以对音素信息进行进一步处理,以提高语音转文本的准确率。例如,该服务器可以基于贝叶斯公式、隐马尔科夫模型(Hidden Markov Model,HMM)的初始概率矩阵、转移概率矩阵和语音片段对应的概率向量进行前向解码,获得所输入的语音片段对应的隐藏状态序列。本申请实施例对上述前向解码的具体过程不作限定。
该服务器可以基于该隐藏状态序列、发音词典以及语言模型,得到文本信息。在一种可能实现方式中,该服务器可以基于发音词典、语言模型等构建WFST(Weighted Finite State Transducer,加权有限状态机)网络,该WFST可以基于输入信息、输出信息以及输入转为输出的权重值,得到语音片段对应的文字组合。其中,该输入信息可以为语音片段对应的隐藏状态序列,输出信息可以为个音素可能对应的文字。
需要说明的是,上述步骤206和步骤207,是基于各个该语音片段的隐层特征,得到该语音数据对应的文本信息的步骤。上述对基于隐层特征获取文本信息的说明,仅是一种示例性说明,本申请实施例对此不作具体限定。
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
本申请实施例提供的技术方案,是通过获取待识别的语音数据;对该语音 数据进行特征提取,得到该语音数据中至少两个语音片段的语音特征;将该至少两个语音片段的语音特征输入语音识别模型,由该语音识别模型中n个级联的隐层依次对各个语音片段的语音特征进行处理,得到各个语音片段的隐层特征,一个语音片段的一个隐层特征是基于时序上位于该一个语音片段之后的n个语音片段确定的;最终基于各个语音片段的隐层特征,得到该语音数据对应的文本信息。
该方法从语音数据中提取出至少两个语音片段的语音特征,之后调用语音识别模型对各个语音片段的语音特征进行学习识别,每一个语音片段的隐层特征是基于其后的语音片段的特征学习得到的,使得该语音片段的隐层特征充分学习到了下文信息,最终识别出的文本信息的语言表达更加通顺、语义更加准确,提高了语音识别的准确率。该方法还通过对语音片段中信号帧的正向运算学习上文信息,使得基于上下文信息识别出的文本信息更加准确,进一步地提高了语音识别的准确率。
在本申请实施例中个,频谱特征提取模块以及该语音识别模型可以组成声学模型,该声学模型、发音词典以及语言模型可以组成解码器,应用该解码器,可以实现对流式语音数据的识别。参见图6,图6是本申请实施例提供的一种解码器结构示意图,该解码器601可以包括声学模型602、发音词典603以及语言模型604,声学模型602可以包括频谱特征提取模块605以及该语音识别模型606。本申请实施例提供的技术方案,基于TLC-BLSTM网络构建语音识别模型,实现流式语音识别,可以稳定地输出高识别率、低延时的语音识别服务。表2示出了语音识别模型分别基于LSTM、BLSTM、LC-BLSTM、TLC-BLSTM构建时模型的语音识别效果,包括延迟时间、平均错字率,其中,该平均错字率可以表示识别每100个字中错误的字数,平均字错率为本方案在多个干净与带噪测试集合下的平均表现。
表2
模型 LSTM BLSTM LC-BLSTM TLC-BLSTM
延迟时间 150ms 不支持流式 250ms 300ms
平均字错率 14.47 12.05 12.5 12.06
基于表2所展示的信息,本申请实施例提供的技术方案不仅支持流式语音识别,而且与支持流式语音识别的其他方案,例如,LSTM、LC-BLSTM,一样拥有低于500ms的低延时,且本方案在平均字错率方面,与BLSTM模型效果 相当,较LSTM、LC-BLSTM有所降低。本方案可以实现稳定低延时的输出,识别结果精准。
上述实施例主要介绍了一种语音识别方法,在本申请实施例中,该服务器上可以设置有显示设备,该服务器可以将语音数据对应的文本信息在显示设备的目标页面进行显示;该服务器还可以将该语音数据对应的文本信息发送至用户终端,在用户终端的目标页面上显示该语音数据对应的文本信息。其中,该目标页面可以为会话页面、搜索页面等,本申请实施例对此不作限定。
图7是本申请实施例提供的一种文本信息显示方式示意图,参见图7,以该目标页面701为会话页面为例,该服务器可以对用户在该会话页面701输入的语音数据进行语音识别,将该语音数据对应的文本信息702返回至终端,在该目标页面701的会话显示区域703进行显示。
本申请实施例提供的技术方案,可以应用于任一需要输入文本信息的场景中,例如,本方案还可以应用于搜索场景,用户在搜索信息时无需逐字输入搜索内容,只需录入语音即可。本方案提供的语音识别技术,可以有效提高信息输入效率。
在本申请实施例中,上述语音识别技术部署于服务器中,该服务器可以为目标应用程序的后台服务器,该服务器可以为该目标应用程序提供语音识别服务。例如,该目标应用程序为即时通讯类应用程序时,用户可以通过语音输入的形式在终端录入语音信息,由服务器将语音信息转换为文本信息,以提高信息输入的效率。图8是本申请实施例提供的一种文本信息显示方法的流程图,参见图8,该方法具体可以包括以下步骤:
801、终端响应于语音输入指令,获取实时输入的语音数据。
其中,该终端可以为用户所使用的计算机设备,例如,该终端可以为手机、电脑等,该终端可以安装和运行有该目标应用程序。
在一种可能实现方式中,该终端检测到用户对语音输入控件的触发操作时,可以获取用户实时输入的语音数据。其中,该触发操作可以为点击操作、长按操作等,本申请实施例对此不作限定。
802、终端对该语音数据进行分片处理,得到至少一个语音分片。
在一种可能实现方式中,该终端可以按照目标周期对用户输入的语音数据 进行分片处理,其中,该目标周期可以由开发人员进行设置,本申请实施例对此不作限定。
803、终端获取各个该语音分片对应的文本信息。
在一种可能实现方式中,该终端可以向服务器发送语音识别请求,该语音识别请求携带该至少一个语音分片,例如,该终端获取到语音分片后,可以将语音分片利用网络协议打包,再将语音识别请求与打包后的语音数据发送至服务器。该终端可以接收该服务器返回的该文本信息,该文本信息是该服务器基于该语音分片中各个语音片段的隐层特征确定的。
在一种可能实现方式中,该终端获取到语音数据后,可以对语音数据进行实时识别,得到文本信息。本申请实施例对此不作限定,在本申请实施例中,以服务器执行语音识别的步骤为例进行说明。
需要说明的是,上述语音识别的方法,与上述步骤202至步骤207中语音识别的方法同理,在此不作赘述。
804、终端响应于语音输入完成的指令,在目标页面显示该文本信息。
其中,该目标页面可以为会话页面,终端可以将获取到的文本信息在该会话页面进行显示。在一种可能实现方式中,该会话页面可以显示有语音输入控件,当终端检测到用户对语音输入控件的触发操作时,可以在该会话界面中显示语音输入窗,用户可以在该语音输入窗显示时录入语音数据,终端通过语音采集设备(比如麦克风)来采集语音数据,当用户触发语音输入完成的指令时,该终端可以隐藏该语音输入窗,终端可以从服务器中获取到本次语音输入对应的全部文本信息,在该会话页面(也即是目标页面)中显示该文本信息。参见图9,示出了本申请实施例提供的一种目标页面示意图,例如图9中的(a)图,该目标页面901可以包括语音输入窗902,该语音输入窗902中可以显示有录音控件903,该终端检测到用户按住该录音控件903时,即可开始采集用户输入的语音信息,当用户松开该录音控件903时,即可触发语音输入完成的指令,该终端可以获取本次输入语音数据对应的文本信息,在目标页面进行显示。
在本申请实施例中,该目标页面也可以为搜索页面,终端可以将获取到的文本信息在该搜索页面进行显示。在一种可能实现方式中,用户在进行搜索时,可以将搜索内容以语音的形式录入,当接收到语音输入完成的指令时,获取本次输入语音数据对应的文本信息,在该搜索页面的搜索框中显示。参见图9中的(b)图,在结束录音之后,在目标页面904中显示有结束录音控件905,当 用户点击该结束录音控件905时,即可触发语音输入完成的指令,该终端可以获取本次输入语音数据对应的文本信息在目标页面进行显示,比如,将文本信息显示在搜索框内。
当然,该语音输入完成的指令还可以通过其他方式触发,本申请实施例对此不作限定。
在一种可能实现方式中,该终端可以在用户进行语音输入的过程中实时显示已输入语音数据对应的文本信息。参见图10,图10是本申请实施例提供的一种文本信息显示效果示意图,以该目标页面为会话页面为例,该会话页面1001中可以显示有语音输入窗1002,该语音输入窗1002可以包括文本信息显示区域1003,终端在将语音分片发送至服务器后,服务器可以实时返回语音分片中语音数据的识别结果,终端可以将各个语音分片对应的识别结果在文本信息显示区域1003中显示。当用户语音输入完毕,终端可以将语音输入完成的指令发送至服务器,服务器结束语音识别,终端可以获取到本次语音识别的全部识别结果,即本次输入语音数据对应的文本信息,终端可以隐藏语音输入窗1002,在该会话页面1001显示该文本信息。例如,该目标页面可以为会话页面等。
在一种可能实现方式中,该终端可以在语音输入完成后,再显示语音数据对应的文本信息。即终端可以基于用户输入的语音数据以及语音输入完成的指令,从服务器中获取全部文本信息,在目标页面显示该文本信息。
需要说明的是,上述对文本信息显示方式的说明,仅是一种示例性说明,本申请实施例对具体采用哪种文本信息显示方式不作限定。
本申请实施例提供的技术方案,将语音识别方案放于云服务中,作为一种基础技术赋能于使用该云服务的用户,使用户在进行输入文本信息时,无需再应用拼音、笔画等传统输入方法,直接说话即可实现快速文本输入。而且,该语音识别方案可以对流式语音,即实时输入的语音,进行实时识别,可以缩短识别时间,提高语音识别效率。
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
图11是本申请实施例提供的一种语音识别装置的结构示意图,该装置可以通过硬件、软件、或者二者的结合实现成为服务器或者终端的部分或者全部,该装置包括:
语音获取模块1101,用于获取待识别的语音数据;
语音特征获取模块1102,用于对该语音数据进行特征提取,得到该语音数据中至少两个语音片段的语音特征;
隐层特征获取模块1103,用于将该至少两个语音片段的语音特征输入语音识别模型,由该语音识别模型中n个级联的隐层依次对各个该语音片段的语音特征进行处理,得到各个该语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于第i个语音片段之后的n个语音片段以及第i个语音片段的语音特征确定的;
文本信息获取模块1104,用于基于各个该语音片段的隐层特征,得到该语音数据对应的文本信息。
在一种可能实现方式中,该语音特征获取模块1102用于:
对该语音数据进行频谱特征提取,得到该语音数据中各个语音帧对应的频谱特征;
将属于同一个该语音片段的至少一个语音帧所对应的频谱特征,拼接为一个该语音特征。
在一种可能实现方式中,该隐层特征获取模块1103用于:
将至少两个语音片段的语音特征输入语音识别模型;
将第i个语音片段的语音特征和第i+1个语音片段的语音特征输入语音识别模型中的第一个隐层,输出第i个语音片段的初始隐层特征;
将第i个语音片段的初始隐层特征和第i+1个语音片段的初始隐层特征输入对于语音识别模型中的第一个中间隐层,输出第i个语音片段的第一个中间隐层特征;其中,第i+1个语音片段的初始隐层特征是第一个隐层基于第i+1个语音片段的语音特征和第i+2个语音片段的语音特征运算得到的;
将第i个语音片段的第j个中间隐层特征和第i+1个语音片段的第j个中间隐层特征输入语音识别模型中的第j+1个中间隐层,输出第i个语音片段的第j+1个中间隐层特征;其中,第i+1个语音片段的第j个中间隐层特征是第j个中间隐层基于第i+1个语音片段的第j-1个中间隐层特征和第i+2个语音片段的第j-1个中间隐层特征运算得到的;
将第i个语音片段的最后一个中间隐层特征输入语音识别模型中的最后一个隐层,输出第i个语音片段的隐层特征;其中,i、j为正整数,第0个中间隐层特征是指初始隐层特征。
在一种可能实现方式中,该隐层特征获取模块1103用于:
将第i个语音片段的特征输入语音识别模型中的第k个隐层,通过第k个隐层对第i个语音片段的特征进行正向运算,得到第一特征;
通过第k个隐层对第i个语音片段的特征、第i+1个语音片段的特征进行逆向运算,得到第二特征;
对第一特征和第二特征进行拼接,得到第i个隐层输出的特征,其中,k为正整数。
在一种可能实现方式中,该隐层特征获取模块1103用于:
获取第i+1个语音片段中第二目标数量个语音帧;
通过第k个隐层对第i个语音片段、第二目标数量个语音帧对应的特征进行逆向运算,得到第二特征。
在一种可能实现方式中,该装置中还设置有发音词典与语音模型;该文本信息获取1104模块,用于:
基于隐层特征,确定各个语音片段对应的音素信息;
基于音素信息、发音词典以及语言模型,确定语音数据对应的文本信息,发音词典用于指示音素与发音之间的映射关系,语言模型用于确定组成文本信息的各个词组所对应的概率值。
在一种可能实现方式中,该装置还包括:
显示模块1105,用于将该语音数据对应的该文本信息在目标页面进行显示。
本申请实施例提供的装置,从语音数据中提取出至少两个语音片段的语音特征,之后调用语音识别模型对各个语音片段的语音特征进行学习识别,每一个语音片段的隐层特征是基于其后的语音片段的特征学习得到的,使得该语音片段的隐层特征充分学习到了下文信息,最终识别出的文本信息的语言表达更加通顺、语义更加准确,提高了语音识别的准确率。
图12是本申请实施例提供的一种语音识别装置的结构示意图,参见图12,该装置包括:
语音获取模块1201,用于响应于语音输入指令,获取实时输入的语音数据;
分片模块1202,用于对该语音数据进行分片处理,得到至少一个语音分片;
文本信息获取模块1203,用于获取各个该语音分片对应的文本信息,所述文本信息是采用权利要求1至6任一所述的语音识别方法得到的;
显示模块1204,用于响应于语音输入完成的指令,在目标页面显示该文本信息。
在一种可能实现方式中,该文本信息获取模块1203用于:
向服务器发送语音识别请求,该语音识别请求携带该至少一个语音分片;
接收该服务器返回的该文本信息,该文本信息是该服务器基于该语音分片中各个语音片段的隐层特征确定的。
本申请实施例提供的装置,由服务器从语音数据中提取出至少两个语音片段的语音特征,之后调用语音识别模型对各个语音片段的语音特征进行学习识别,每一个语音片段的隐层特征是基于其后的语音片段的特征学习得到的,使得该语音片段的隐层特征充分学习到了下文信息,最终识别出的文本信息的语言表达更加通顺、语义更加准确,提高了语音识别的准确率。
需要说明的是:上述实施例提供的语音识别装置在语音识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音识别装置与基于语音识别方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述技术方案所提供的计算机设备可以实现为终端或服务器,例如,图13是本申请实施例提供的一种终端的结构示意图。该终端1300可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1300还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1300包括有:一个或多个处理器1301和一个或多个存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下 的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1301还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1302中的非暂态的计算机可读存储介质用于存储至少一条程序代码,该至少一条程序代码用于被处理器1301所执行以实现本申请中方法实施例提供的语音识别方法。
在一些实施例中,终端1300还可选包括有:外围设备接口1303和至少一个外围设备。处理器1301、存储器1302和外围设备接口1303之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1303相连。具体地,外围设备包括:射频电路1304、显示屏1305、摄像头组件1306、音频电路1307、定位组件1308和电源1309中的至少一种。
外围设备接口1303可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1301和存储器1302。在一些实施例中,处理器1301、存储器1302和外围设备接口1303被集成在同一芯片或电路板上;在一些其他实施例中,处理器1301、存储器1302和外围设备接口1303中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路1304用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路1304通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1304将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路1304包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1304可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路1304还可以包括NFC(Near Field Communication,近距离无线 通信)有关的电路,本申请对此不加以限定。
显示屏1305用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1305是触摸显示屏时,显示屏1305还具有采集在显示屏1305的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1301进行处理。此时,显示屏1305还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1305可以为一个,设置终端1300的前面板;在另一些实施例中,显示屏1305可以为至少两个,分别设置在终端1300的不同表面或呈折叠设计;在一些实施例中,显示屏1305可以是柔性显示屏,设置在终端1300的弯曲表面上或折叠面上。甚至,显示屏1305还可以设置成非矩形的不规则图形,也即异形屏。显示屏1305可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件1306用于采集图像或视频。可选地,摄像头组件1306包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件1306还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路1307可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器1301进行处理,或者输入至射频电路1304以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端1300的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1301或射频电路1304的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路1307还可以包括耳机插孔。
定位组件1308用于定位终端1300的当前地理位置,以实现导航或LBS (Location Based Service,基于位置的服务)。定位组件1308可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源1309用于为终端1300中的各个组件进行供电。电源1309可以是交流电、直流电、一次性电池或可充电电池。当电源1309包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。
在一些实施例中,终端1300还包括有一个或多个传感器1310。该一个或多个传感器1310包括但不限于:加速度传感器1311、陀螺仪传感器1312、压力传感器1313、指纹传感器1314、光学传感器1315以及接近传感器1316。
加速度传感器1311可以检测以终端1300建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器1311可以用于检测重力加速度在三个坐标轴上的分量。处理器1301可以根据加速度传感器1311采集的重力加速度信号,控制显示屏1305以横向视图或纵向视图进行用户界面的显示。加速度传感器1311还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器1312可以检测终端1300的机体方向及转动角度,陀螺仪传感器1312可以与加速度传感器1311协同采集用户对终端1300的3D动作。处理器1301根据陀螺仪传感器1312采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器1313可以设置在终端1300的侧边框和/或显示屏1305的下层。当压力传感器1313设置在终端1300的侧边框时,可以检测用户对终端1300的握持信号,由处理器1301根据压力传感器1313采集的握持信号进行左右手识别或快捷操作。当压力传感器1313设置在显示屏1305的下层时,由处理器1301根据用户对显示屏1305的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器1314用于采集用户的指纹,由处理器1301根据指纹传感器1314采集到的指纹识别用户的身份,或者,由指纹传感器1314根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器1301授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、 支付及更改设置等。指纹传感器1314可以被设置终端1300的正面、背面或侧面。当终端1300上设置有物理按键或厂商Logo时,指纹传感器1314可以与物理按键或厂商Logo集成在一起。
光学传感器1315用于采集环境光强度。在一个实施例中,处理器1301可以根据光学传感器1315采集的环境光强度,控制显示屏1305的显示亮度。具体地,当环境光强度较高时,调高显示屏1305的显示亮度;当环境光强度较低时,调低显示屏1305的显示亮度。在另一个实施例中,处理器1301还可以根据光学传感器1315采集的环境光强度,动态调整摄像头组件1306的拍摄参数。
接近传感器1316,也称距离传感器,通常设置在终端1300的前面板。接近传感器1316用于采集用户与终端1300的正面之间的距离。在一个实施例中,当接近传感器1316检测到用户与终端1300的正面之间的距离逐渐变小时,由处理器1301控制显示屏1305从亮屏状态切换为息屏状态;当接近传感器1316检测到用户与终端1300的正面之间的距离逐渐变大时,由处理器1301控制显示屏1305从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图13中示出的结构并不构成对终端1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图14是本申请实施例提供的一种服务器的结构示意图,该服务器1400可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(Central Processing Units,CPU)1401和一个或多个的存储器1402,其中,该一个或多个存储器1402中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器1401加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器1400还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1400还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括至少一条程序代码的存储器,上述至少一条程序代码可由处理器执行以完成上述实施例中的语音识别方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、 只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来至少一条程序代码相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音识别方法,其特征在于,应用于计算机设备中,所述计算机设备中设置有语音识别模型,所述方法包括:
    获取待识别的语音数据;
    对所述语音数据进行特征提取,得到所述语音数据中至少两个语音片段的语音特征;
    将所述至少两个语音片段的语音特征输入所述语音识别模型,由所述语音识别模型中n个级联的隐层依次对各个所述语音片段的语音特征进行处理,得到各个所述语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于所述第i个语音片段之后的n个语音片段以及所述第i个语音片段的语音特征确定的,n、i均为正整数;
    基于各个所述语音片段的隐层特征,得到所述语音数据对应的文本信息。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述语音数据进行特征提取,得到所述语音数据中至少两个语音片段的语音特征,包括:
    对所述语音数据进行频谱特征提取,得到所述语音数据中各个语音帧对应的频谱特征;
    将属于同一个所述语音片段的至少一个语音帧所对应的频谱特征,拼接为一个所述语音特征。
  3. 根据权利要求1所述的方法,其特征在于,所述将所述至少两个语音片段的语音特征输入语音识别模型,由所述语音识别模型中n个级联的隐层依次对各个所述语音片段的语音特征进行处理,得到各个所述语音片段的隐层特征,包括:
    将所述至少两个语音片段的语音特征输入所述语音识别模型;
    将所述第i个语音片段的语音特征和第i+1个语音片段的语音特征输入所述语音识别模型中的第一个隐层,输出所述第i个语音片段的初始隐层特征;
    将所述第i个语音片段的初始隐层特征和所述第i+1个语音片段的初始隐层特征输入所述语音识别模型中的第一个中间隐层,输出所述第i个语音片段的第 一个中间隐层特征;其中,所述第i+1个语音片段的初始隐层特征是所述第一个隐层基于所述第i+1个语音片段的语音特征和第i+2个语音片段的语音特征运算得到的;
    将所述第i个语音片段的第j个中间隐层特征和所述第i+1个语音片段的第j个中间隐层特征输入所述语音识别模型中的第j+1个中间隐层,输出所述第i个语音片段的第j+1个中间隐层特征;其中,所述第i+1个语音片段的第j个中间隐层特征是所述第j个中间隐层基于所述第i+1个语音片段的第j-1个中间隐层特征和所述第i+2个语音片段的第j-1个中间隐层特征运算得到的;
    将所述第i个语音片段的最后一个中间隐层特征输入所述语音识别模型中的最后一个隐层,输出所述第i个语音片段的隐层特征;其中,j为正整数,第0个中间隐层特征是指初始隐层特征。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    将所述第i个语音片段的特征输入所述语音识别模型中的第k个隐层,通过所述第k个隐层对所述第i个语音片段的特征进行正向运算,得到第一特征;
    通过所述第k个隐层对所述第i个语音片段的特征、所述第i+1个语音片段的特征进行逆向运算,得到第二特征;
    对所述第一特征和所述第二特征进行拼接,得到所述第i个隐层输出的特征,其中,k为正整数。
  5. 根据权利要求4所述的方法,其特征在于,所述通过所述第k个隐层对所述第i个语音片段的特征、所述第i+1个语音片段的特征进行逆向运算,得到第二特征,包括:
    获取所述第i+1个语音片段中第二目标数量个语音帧;
    通过所述第k个隐层对所述第i个语音片段、所述第二目标数量个语音帧对应的特征进行逆向运算,得到所述第二特征。
  6. 根据权利要求1所述的方法,其特征在于,所述计算机设备中还设置有发音词典与语言模型;
    所述基于各个所述语音片段的隐层特征,得到所述语音数据对应的文本信 息,包括:
    基于所述隐层特征,确定各个所述语音片段对应的音素信息;
    基于所述音素信息、所述发音词典以及所述语言模型,确定所述语音数据对应的文本信息,所述发音词典用于指示音素与发音之间的映射关系,所述语言模型用于确定组成所述文本信息的各个词组所对应的概率值。
  7. 根据权利要求1所述的方法,其特征在于,所述基于各个所述语音片段的隐层特征,得到所述语音数据对应的文本信息之后,所述方法还包括:
    将所述语音数据对应的所述文本信息在目标页面进行显示。
  8. 一种语音识别方法,其特征在于,应用于终端中,所述方法包括:
    响应于语音输入指令,获取实时输入的语音数据;
    对所述语音数据进行分片处理,得到至少一个语音分片;
    获取各个所述语音分片对应的文本信息,所述文本信息是采用如上所述的权利要求1至6任一所述的语音识别方法得到的;
    响应于语音输入完成的指令,在目标页面显示所述文本信息。
  9. 根据权利要求8所述的方法,其特征在于,所述获取各个所述语音分片对应的文本信息,包括:
    向服务器发送语音识别请求,所述语音识别请求携带所述至少一个语音分片;
    接收所述服务器返回的所述文本信息,所述文本信息是所述服务器基于所述语音分片中各个语音片段的隐层特征确定的。
  10. 一种语音识别装置,其特征在于,所述装置中设置有语音识别模型,所述装置包括:
    语音获取模块,用于获取待识别的语音数据;
    语音特征获取模块,用于对所述语音数据进行特征提取,得到所述语音数据中至少两个语音片段的语音特征;
    隐层特征获取模块,用于将所述至少两个语音片段的语音特征输入所述语 音识别模型,由所述语音识别模型中n个级联的隐层依次对各个所述语音片段的语音特征进行处理,得到各个所述语音片段的隐层特征,第i个语音片段的隐层特征是基于时序上位于所述第i个语音片段之后的n个语音片段以及所述第i个语音片段的语音特征确定的,n、i均为正整数;
    文本信息获取模块,用于基于各个所述语音片段的隐层特征,得到所述语音数据对应的文本信息。
  11. 根据权利要求10所述的装置,其特征在于,所述语音特征获取模块用于:
    对所述语音数据进行频谱特征提取,得到所述语音数据中各个语音帧对应的频谱特征;
    将属于同一个所述语音片段的至少一个语音帧所对应的频谱特征,拼接为一个所述语音特征。
  12. 根据权利要求10所述的装置,其特征在于,所述隐层特征获取模块用于:
    将所述至少两个语音片段的语音特征输入所述语音识别模型;
    将所述第i个语音片段的语音特征和第i+1个语音片段的语音特征输入所述语音识别模型中的第一个隐层,输出所述第i个语音片段的初始隐层特征;
    将所述第i个语音片段的初始隐层特征和所述第i+1个语音片段的初始隐层特征输入对于所述语音识别模型中的第一个中间隐层,输出所述第i个语音片段的第一个中间隐层特征;其中,所述第i+1个语音片段的初始隐层特征是所述第一个隐层基于所述第i+1个语音片段的语音特征和第i+2个语音片段的语音特征运算得到的;
    将所述第i个语音片段的第j个中间隐层特征和所述第i+1个语音片段的第j个中间隐层特征输入所述语音识别模型中的第j+1个中间隐层,输出所述第i个语音片段的第j+1个中间隐层特征;其中,所述第i+1个语音片段的第j个中间隐层特征是所述第j个中间隐层基于所述第i+1个语音片段的第j-1个中间隐层特征和所述第i+2个语音片段的第j-1个中间隐层特征运算得到的;
    将所述第i个语音片段的最后一个中间隐层特征输入所述语音识别模型中 的最后一个隐层,输出所述第i个语音片段的隐层特征;其中,j为正整数,第0个中间隐层特征是指初始隐层特征。
  13. 根据权利要求12所述的装置,其特征在于,所述隐层特征获取模块用于:
    将所述第i个语音片段的特征输入所述语音识别模型中的第k个隐层,通过所述第k个隐层对所述第i个语音片段的特征进行正向运算,得到第一特征;
    通过所述第k个隐层对所述第i个语音片段的特征、所述第i+1个语音片段的特征进行逆向运算,得到第二特征;
    对所述第一特征和所述第二特征进行拼接,得到所述第i个隐层输出的特征,其中,k为正整数。
  14. 根据权利要求13所述的装置,其特征在于,所述隐层特征获取模块用于:
    获取所述第i+1个语音片段中第二目标数量个语音帧;
    通过所述第k个隐层对所述第i个语音片段、所述第二目标数量个语音帧对应的特征进行逆向运算,得到所述第二特征。
  15. 根据权利要求10所述的装置,其特征在于,所述装置中还设置有发音词典与语音模型;所述文本信息获取模块,用于:
    基于所述隐层特征,确定各个所述语音片段对应的音素信息;
    基于所述音素信息、所述发音词典以及所述语言模型,确定所述语音数据对应的文本信息,所述发音词典用于指示音素与发音之间的映射关系,所述语言模型用于确定组成所述文本信息的各个词组所对应的概率值。
  16. 根据权利要求10所述的装置,其特征在于,所述装置还包括显示模块,所述显示模块,用于:
    将所述语音数据对应的所述文本信息在目标页面进行显示。
  17. 一种语音识别装置,其特征在于,所述装置包括:
    语音获取模块,用于响应于语音输入指令,获取实时输入的语音数据;
    分片模块,用于对所述语音数据进行分片处理,得到至少一个语音分片;
    文本信息获取模块,用于获取各个所述语音分片对应的文本信息,所述文本信息是采用如上所述的权利要求1至6任一所述的语音识别方法得到的;
    显示模块,用于响应于语音输入完成的指令,在目标页面显示所述文本信息。
  18. 根据权利要求17所述的装置,其特征在于,所述文本信息获取模块,用于:
    向服务器发送语音识别请求,所述语音识别请求携带所述至少一个语音分片;
    接收所述服务器返回的所述文本信息,所述文本信息是所述服务器基于所述语音分片中各个语音片段的隐层特征确定的。
  19. 一种计算机设备,其特征在于,所述计算机设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述至少一个处理器加载并执行以实现如权利要求1至权利要求7任一项所述的语音识别方法所执行的操作;或如权利要求8或9所述的语音识别方法所执行的操作。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至权利要求7任一项所述的语音识别方法所执行的操作;或如权利要求8或9所述的语音识别方法所执行的操作。
PCT/CN2020/123738 2020-01-22 2020-10-26 语音识别方法、装置、计算机设备及计算机可读存储介质 WO2021147417A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/709,011 US20220223142A1 (en) 2020-01-22 2022-03-30 Speech recognition method and apparatus, computer device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010074075.XA CN112750425B (zh) 2020-01-22 2020-01-22 语音识别方法、装置、计算机设备及计算机可读存储介质
CN202010074075.X 2020-01-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/709,011 Continuation US20220223142A1 (en) 2020-01-22 2022-03-30 Speech recognition method and apparatus, computer device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021147417A1 true WO2021147417A1 (zh) 2021-07-29

Family

ID=75645224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123738 WO2021147417A1 (zh) 2020-01-22 2020-10-26 语音识别方法、装置、计算机设备及计算机可读存储介质

Country Status (3)

Country Link
US (1) US20220223142A1 (zh)
CN (1) CN112750425B (zh)
WO (1) WO2021147417A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763932B (zh) * 2021-05-13 2024-02-13 腾讯科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301864A (zh) * 2017-08-16 2017-10-27 重庆邮电大学 一种基于Maxout神经元的深度双向LSTM声学模型
WO2018071389A1 (en) * 2016-10-10 2018-04-19 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
CN108417202A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 语音识别方法及系统
CN110415702A (zh) * 2019-07-04 2019-11-05 北京搜狗科技发展有限公司 训练方法和装置、转换方法和装置

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4625081A (en) * 1982-11-30 1986-11-25 Lotito Lawrence A Automated telephone voice service system
US7467087B1 (en) * 2002-10-10 2008-12-16 Gillick Laurence S Training and using pronunciation guessers in speech recognition
US20120316875A1 (en) * 2011-06-10 2012-12-13 Red Shift Company, Llc Hosted speech handling
US20150032449A1 (en) * 2013-07-26 2015-01-29 Nuance Communications, Inc. Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
US9620145B2 (en) * 2013-11-01 2017-04-11 Google Inc. Context-dependent state tying using a neural network
US9202469B1 (en) * 2014-09-16 2015-12-01 Citrix Systems, Inc. Capturing noteworthy portions of audio recordings
CN106940998B (zh) * 2015-12-31 2021-04-16 阿里巴巴集团控股有限公司 一种设定操作的执行方法及装置
CN105869624B (zh) * 2016-03-29 2019-05-10 腾讯科技(深圳)有限公司 数字语音识别中语音解码网络的构建方法及装置
US10916235B2 (en) * 2017-07-10 2021-02-09 Vox Frontera, Inc. Syllable based automatic speech recognition
CN107680597B (zh) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN109754789B (zh) * 2017-11-07 2021-06-08 北京国双科技有限公司 语音音素的识别方法及装置
US10380997B1 (en) * 2018-07-27 2019-08-13 Deepgram, Inc. Deep learning internal state index-based search and classification
CN110418208B (zh) * 2018-11-14 2021-07-27 腾讯科技(深圳)有限公司 一种基于人工智能的字幕确定方法和装置
CN110415686B (zh) * 2019-05-21 2021-08-17 腾讯科技(深圳)有限公司 语音处理方法、装置、介质、电子设备
CN110176235B (zh) * 2019-05-23 2022-02-01 腾讯科技(深圳)有限公司 语音识别文的展示方法、装置、存储介质和计算机设备
CN110189749B (zh) * 2019-06-06 2021-03-19 四川大学 语音关键词自动识别方法
KR20190080833A (ko) * 2019-06-18 2019-07-08 엘지전자 주식회사 음성 정보 기반 언어 모델링 시스템 및 방법
CN110428809B (zh) * 2019-06-28 2022-04-26 腾讯科技(深圳)有限公司 语音音素识别方法和装置、存储介质及电子装置
CN110600018B (zh) * 2019-09-05 2022-04-26 腾讯科技(深圳)有限公司 语音识别方法及装置、神经网络训练方法及装置
CN110634469B (zh) * 2019-09-27 2022-03-11 腾讯科技(深圳)有限公司 基于人工智能的语音信号处理方法、装置及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018071389A1 (en) * 2016-10-10 2018-04-19 Google Llc Very deep convolutional neural networks for end-to-end speech recognition
CN107301864A (zh) * 2017-08-16 2017-10-27 重庆邮电大学 一种基于Maxout神经元的深度双向LSTM声学模型
CN108417202A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 语音识别方法及系统
CN110415702A (zh) * 2019-07-04 2019-11-05 北京搜狗科技发展有限公司 训练方法和装置、转换方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUE SHAOFEI; YAN ZHIJIE: "Improving latency-controlled BLSTM acoustic models for online speech recognition", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 5 March 2017 (2017-03-05), pages 5340 - 5344, XP033259430, DOI: 10.1109/ICASSP.2017.7953176 *

Also Published As

Publication number Publication date
CN112750425A (zh) 2021-05-04
CN112750425B (zh) 2023-11-03
US20220223142A1 (en) 2022-07-14

Similar Documents

Publication Publication Date Title
EP3792911B1 (en) Method for detecting key term in speech signal, device, terminal, and storage medium
EP4006901A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
US20200294488A1 (en) Method, device and storage medium for speech recognition
CN110379430B (zh) 基于语音的动画显示方法、装置、计算机设备及存储介质
WO2021135628A1 (zh) 语音信号的处理方法、语音分离方法
CN111063342B (zh) 语音识别方法、装置、计算机设备及存储介质
US11935517B2 (en) Speech decoding method and apparatus, computer device, and storage medium
CN110322760B (zh) 语音数据生成方法、装置、终端及存储介质
CN110047468B (zh) 语音识别方法、装置及存储介质
WO2021114847A1 (zh) 网络通话方法、装置、计算机设备及存储介质
CN113763933B (zh) 语音识别方法、语音识别模型的训练方法、装置和设备
CN112735429B (zh) 确定歌词时间戳信息的方法和声学模型的训练方法
CN110992927A (zh) 音频生成方法、装置、计算机可读存储介质及计算设备
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
WO2021147417A1 (zh) 语音识别方法、装置、计算机设备及计算机可读存储介质
CN113220590A (zh) 语音交互应用的自动化测试方法、装置、设备及介质
CN113409770A (zh) 发音特征处理方法、装置、服务器及介质
CN111428079B (zh) 文本内容处理方法、装置、计算机设备及存储介质
CN110337030B (zh) 视频播放方法、装置、终端和计算机可读存储介质
CN112289302A (zh) 音频数据的合成方法、装置、计算机设备及可读存储介质
CN116956814A (zh) 标点预测方法、装置、设备及存储介质
CN112786025B (zh) 确定歌词时间戳信息的方法和声学模型的训练方法
WO2020102943A1 (zh) 手势识别模型的生成方法、装置、存储介质及电子设备
CN110288999B (zh) 语音识别方法、装置、计算机设备及存储介质
CN115394285A (zh) 语音克隆方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915622

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20915622

Country of ref document: EP

Kind code of ref document: A1