WO2022152029A1 - 语音识别方法、装置、计算机设备及存储介质 - Google Patents

语音识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022152029A1
WO2022152029A1 PCT/CN2022/070388 CN2022070388W WO2022152029A1 WO 2022152029 A1 WO2022152029 A1 WO 2022152029A1 CN 2022070388 W CN2022070388 W CN 2022070388W WO 2022152029 A1 WO2022152029 A1 WO 2022152029A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
model
network
data
speech
Prior art date
Application number
PCT/CN2022/070388
Other languages
English (en)
French (fr)
Inventor
苏丹
贺利强
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023524506A priority Critical patent/JP2023549048A/ja
Publication of WO2022152029A1 publication Critical patent/WO2022152029A1/zh
Priority to US17/987,287 priority patent/US20230075893A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a speech recognition method, apparatus, computer equipment and storage medium.
  • Speech recognition is a technology that recognizes speech as text, which has a wide range of applications in various artificial intelligence (Artificial Intelligence, AI) scenarios.
  • AI Artificial Intelligence
  • the speech recognition model needs to refer to the context information of the speech in the process of recognizing the input speech, that is to say, when recognizing the speech data, it is necessary to combine the speech
  • the historical information and future information of the data are identified.
  • the speech recognition model since the speech recognition model introduces future information in the speech recognition process, it will cause a certain delay, thereby limiting the application of the incoming speech recognition model in streaming speech recognition.
  • the embodiments of the present application provide a speech recognition method, device, computer equipment, and storage medium, which can reduce the recognition delay in a streaming speech recognition scenario and improve the effect of streaming speech recognition.
  • the technical solution is as follows:
  • an embodiment of the present application provides a speech recognition method for computer equipment, the method comprising:
  • the streaming speech data is processed through a speech recognition model to obtain speech recognition text corresponding to the streaming speech data;
  • the speech recognition model is obtained by performing a neural network structure search on the initial network; in the initial network It includes a plurality of feature aggregation nodes connected by a first type of operation element, the operation space corresponding to the first type of operation element is the first operation space, and the specified operation that depends on the context information in the first operation space is designed to not Reliance on future data;
  • the speech recognition text is output.
  • an embodiment of the present application provides a speech recognition method for a computer device, the method comprising:
  • the voice training sample includes a voice sample and a voice recognition label corresponding to the voice sample
  • a neural network structure search is performed on the initial network to obtain a network search model;
  • the initial network includes a plurality of feature aggregation nodes connected by a first type of operator, and the first type of operator corresponds to a plurality of feature aggregation nodes.
  • the operation space is a first operation space, and specified operations that depend on context information in the first operation space are designed not to depend on future data;
  • a speech recognition model is constructed based on the network search model; the speech recognition model is used for processing the input streaming speech data to obtain speech recognition text corresponding to the streaming speech data.
  • an embodiment of the present application provides a speech recognition device, the device comprising:
  • the voice data receiving module is used to receive streaming voice data.
  • a voice data processing module configured to process the streaming voice data through a voice recognition model to obtain a voice recognition text corresponding to the streaming voice data;
  • the voice recognition model is obtained by performing a neural network structure search on an initial network
  • the initial network includes a plurality of feature aggregation nodes connected by the first type of operator, the operation space corresponding to the first type of operator is the first operation space, and the first operation space depends on context information specified operations are designed to be independent of future data;
  • a text output module for outputting the speech recognition text.
  • an embodiment of the present application provides a speech recognition device, the device comprising:
  • a sample acquisition module configured to acquire a voice training sample, where the voice training sample includes a voice sample and a voice recognition label corresponding to the voice sample;
  • the network search module is used to perform a neural network structure search on the initial network based on the voice training sample to obtain a network search model;
  • the initial network includes a plurality of feature aggregation nodes connected by a first type of operation element, the first
  • the operation space corresponding to one type of operation element is the first operation space, and the specified operation that depends on the context information in the first operation space is designed not to depend on future data;
  • the model building module is used for building a speech recognition model based on the network search model; the speech recognition model is used for processing the input streaming speech data to obtain speech recognition text corresponding to the streaming speech data.
  • an embodiment of the present application provides a computer device, the computer device includes a processor and a memory, the memory stores at least one computer instruction, and the at least one computer instruction is loaded and executed by the processor In order to realize the above-mentioned speech recognition method.
  • an embodiment of the present application provides a computer-readable storage medium, where at least one computer instruction is stored in the storage medium, and the at least one computer instruction is loaded and executed by a processor to implement the above speech recognition method.
  • an embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the above-mentioned speech recognition method.
  • the speech recognition model is constructed by setting the specified operations that need to depend on context information in the operation space corresponding to the first type of operator in the initial network to not depend on future data, and then performing a neural network structure search on the initial network. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • Fig. 1 is a kind of model search and speech recognition framework diagram shown according to an exemplary embodiment
  • FIG. 2 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
  • FIG. 3 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
  • FIG. 4 is a schematic flowchart of a speech recognition method according to an exemplary embodiment
  • FIG. 5 is a schematic diagram of the network structure involved in the embodiment shown in FIG. 4;
  • FIG. 6 is a schematic diagram of a convolution operation involved in the embodiment shown in FIG. 4;
  • FIG. 7 is a schematic diagram of another convolution operation involved in the embodiment shown in FIG. 4;
  • Fig. 8 is a schematic diagram of a causal convolution involved in the embodiment shown in Fig. 4;
  • FIG. 9 is a schematic diagram of another causal convolution involved in the embodiment shown in FIG. 4;
  • FIG. 10 is a schematic diagram of a model construction and speech recognition framework according to an exemplary embodiment
  • FIG. 11 is a block diagram showing the structure of a speech recognition apparatus according to an exemplary embodiment
  • FIG. 12 is a block diagram showing the structure of a speech recognition apparatus according to an exemplary embodiment
  • Fig. 13 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Neural network structure search is a strategy to design a neural network with an algorithm, that is, when the length and structure of the network are uncertain, a certain search space is artificially set, and the search space is searched according to the designed search strategy. The best performing network structure on the validation set.
  • the neural network structure search technology consists of three parts: search space, search strategy, and evaluation and estimation. From the implementation, it is divided into NAS based on reinforcement learning, NAS based on genetic algorithm (also called NAS based on evolution), and Differentiable NAS (also known as gradient-based NAS).
  • NAS based on reinforcement learning uses a recurrent neural network as a controller to generate a sub-network, then train and evaluate the sub-network to obtain its network performance (such as accuracy), and finally update the parameters of the controller.
  • the performance of the sub-network is non-steerable, and the controller cannot be directly optimized.
  • Only the reinforcement learning method can be used to update the controller parameters based on the policy gradient method.
  • this method consumes too much computing resources. The reason is that in this type of NAS algorithm, in order to fully exploit the "potential" of each sub-network, each time the controller samples a sub-network, it must initialize Its network weights are trained from scratch and then verified for its performance.
  • Differentiable NAS based on gradient optimization constructs the entire search space as a super-net, and then models the training and search process as a bi-level optimization problem, which does not sample a sub- The network is then trained from scratch to verify its performance. Since the supernet itself is composed of a set of subnetworks, it uses the accuracy of the current supernet to approximate the performance of the current subnet with the highest probability, so it has extremely high search efficiency and performance. Become the mainstream neural network structure search method.
  • a supernet is a set containing all possible subnetworks in a differentiable NAS. Developers can design a large search space, and this search space will form a supernet. This supernet contains multiple sub-networks. After training, each sub-network can be evaluated for performance indicators. Neural network structure search All that needs to be done is to find the subnet with the best performance index from these subnets.
  • ASR Automatic Speech Recognition
  • TTS Text To Speech
  • voiceprint recognition technology Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
  • Fig. 1 is a framework diagram of a model search and speech recognition according to an exemplary embodiment.
  • the model training device 110 performs a neural network structure search on a preset initial network through preset voice training samples, and builds a voice recognition model with high accuracy based on the search results
  • the speech recognition device 120 recognizes the speech recognition text in the streaming speech data according to the constructed speech recognition model and the input streaming speech data.
  • the above-mentioned initial network may refer to a search space or a supernet in a neural network structure search.
  • the above searched speech recognition model may be a subnet in the supernet.
  • the above-mentioned model training device 110 and speech recognition device 120 may be computer devices with machine learning capabilities.
  • the computer devices may be stationary computer devices such as personal computers and servers, or the computer devices may also be tablet computers, Mobile computer devices such as e-book readers.
  • model training device 110 and the speech recognition device 120 may be the same device, or the model training device 110 and the speech recognition device 120 may also be different devices. Also, when the model training device 110 and the speech recognition device 120 are different devices, the model training device 110 and the speech recognition device 120 may be the same type of device, for example, the model training device 110 and the speech recognition device 120 may both be personal computers; Alternatively, the model training device 110 and the speech recognition device 120 may also be different types of devices.
  • the model training device 110 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.
  • the voice recognition device 120 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • the above-mentioned model training equipment searches the initial network for a neural network structure, and builds a speech recognition model based on the search results, and its application scenarios may include but are not limited to the following application scenarios:
  • the application of speech recognition is usually involved.
  • the speech recognition text is recognized through the speech recognition model, and the speech recognition text is displayed on the display screen of the network conference.
  • the recognized speech recognition text may also be translated and displayed (for example, displayed by text or voice).
  • the speech recognition model involved in the present application low-latency speech recognition can be achieved, so as to satisfy the instant speech recognition in the network conference scenario.
  • the application of speech recognition will also be involved.
  • the live broadcast scene usually needs to add subtitles to the live broadcast screen.
  • the speech recognition model involved in this application can realize low-latency recognition of the speech in the live stream, so that subtitles can be generated as soon as possible and added to the live stream data stream, which is of great significance for reducing the delay of the live stream.
  • the speech recognition model involved in the present application can realize low-latency recognition of the speech of the participants, so as to quickly display the recognized text and display it through the display screen or the translated speech, thereby realizing automatic real-time translation.
  • Fig. 2 is a schematic flowchart of a speech recognition method according to an exemplary embodiment. The method may be performed by the speech recognition device in the embodiment shown in FIG. 1 above. As shown in Figure 2, the speech recognition method may include the following steps:
  • Step 21 receiving streaming voice data.
  • the streaming voice data is audio stream data generated by encoding real-time voice
  • the streaming voice data has a relatively high latency requirement for voice recognition, that is, it is necessary to ensure that the input streaming voice data is output to the output.
  • the delay between speech recognition results is short.
  • Step 22 Process the streamed voice data through a speech recognition model to obtain a speech recognition text corresponding to the streamed voice data;
  • the speech recognition model is obtained by performing a neural network structure search on the initial network;
  • the initial network contains A plurality of feature aggregation nodes connected by a first type of operator, the operation control corresponding to the first type of operator is the first operation space, and the specified operation in the first operation control that depends on the context information is designed to be package independent of future data .
  • the speech recognition model is a streaming speech recognition model (Streaming ASR Model). Different from the non-streaming speech recognition model, when processing non-streaming speech data, the speech recognition result must be returned after processing the complete sentence audio.
  • the streaming speech recognition model supports real-time return of speech recognition results when processing streaming speech data.
  • the above-mentioned future data refers to other voice data located after the currently recognized voice data in the time domain.
  • For a specified operation that relies on future data when the current voice data is recognized through the specified operation, it is necessary to wait for the arrival of the future data to complete the recognition of the current voice data, which will cause a certain delay, and with the operation of such operations As the number increases, the delay in completing the recognition of the current speech data will also increase accordingly.
  • the current voice data can be recognized without waiting for the arrival of future data. In this process, waiting for future data will not be introduced. delay.
  • the above-mentioned specified operation that does not depend on future data refers to the operation that the processing process can be completed based on the current voice data and the historical data before the current voice data during the feature processing process of the voice data. .
  • Step 23 output the speech recognition text.
  • the specified operation that needs to depend on context information is set to not depend on future data, and then the initial network is performed.
  • Neural network structure search to build speech recognition models. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • Fig. 3 is a schematic flowchart of a speech recognition method according to an exemplary embodiment.
  • the method may be performed by the model training device in the embodiment shown in FIG. 1 above, and the speech recognition method may be a method performed based on a neural network structure search.
  • the speech recognition method may include the following steps:
  • Step 31 Acquire a voice training sample, where the voice training sample includes a voice sample and a voice recognition label corresponding to the voice sample.
  • Step 32 based on the voice training sample, perform a neural network structure search on the initial network to obtain a network search model;
  • the initial network includes a plurality of feature aggregation nodes connected by a first type of operator, and the corresponding first type of operator
  • the operation space is a first operation space, and specified operations in the first operation space that depend on the context information are involved as being independent of future data.
  • the embodiment of the present application improves the traditional NAS solution, and designs the specified operation (neural network operation) that originally relied on historical data and future data in the operation space to only rely on historical data, that is, the specified operation is designed It is a delay-free method, so that a low-latency neural network structure is searched in the subsequent neural network structure search process.
  • the first type of operand is obtained by combining at least one operation in the first operation space.
  • Step 33 constructing a speech recognition model based on the network search model; the speech recognition model is used to process the input streaming speech data to obtain speech recognition text corresponding to the streaming speech data.
  • the specified operation that needs to depend on context information is set to not depend on future data, and then the initial network is performed.
  • Neural network structure search to build speech recognition models. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • Fig. 4 is a schematic flowchart of a speech recognition method according to an exemplary embodiment.
  • the method may be performed by a model training device and a speech recognition device, wherein the model training device and the speech recognition device may be implemented as a single computer device, or may belong to different computer devices.
  • the method may include the following steps:
  • Step 401 The model training device obtains a voice training sample, where the voice training sample includes a voice sample and a voice recognition label corresponding to the voice sample.
  • the voice training sample is a set of samples collected in advance by the developer, and the voice training sample includes each voice sample and the voice recognition label corresponding to the voice sample, and the voice recognition label is used for the subsequent network structure search process. training and evaluation.
  • the speech recognition tag includes acoustic identification information of the speech sample; the acoustic identification information includes phonemes, syllables or semi-syllables.
  • the speech recognition label may be information corresponding to the output result of the acoustic model, for example, Phonemes, syllables, or semi-syllables, etc.
  • the above-mentioned speech sample may be pre-segmented into several overlapping short-term speech segments (also called speech frames), and each speech frame corresponds to its own phoneme, syllable or semi-syllable.
  • short-term speech segments also called speech frames
  • each speech frame corresponds to its own phoneme, syllable or semi-syllable.
  • the speech length of a frame after segmentation is 25ms
  • the overlap between frames is 15ms. This process is also called "framing".
  • Step 402 the model training device performs a neural network structure search on the initial network based on the voice training sample to obtain a network search model.
  • the initial network includes multiple feature aggregation nodes connected by operators, the operators between the multiple feature aggregation nodes include a first type of operators, and the first operation space corresponding to the first type of operators
  • the included specified operations that depend on context information are designed not to depend on future data; a combination of one or more operations in the first operation space is used to implement the first type of operands; the specified operations are context-dependent neural network operations.
  • the above-mentioned first operation space may also include operations that do not depend on context information, such as residual connection operations.
  • the types of operations contained in are not limited.
  • the initial network includes n unit networks, the n unit networks include at least one first unit network, and the first unit network includes an input node, an output node, and a At least one of the feature aggregation nodes to which the type operand is connected.
  • the above-mentioned initial network can be divided into unit networks, each unit network includes an input node and an output node, and one or more feature aggregation nodes between the input node and the output node.
  • the search space of each unit network in the initial network may be the same or different.
  • the n unit networks are connected by at least one of the following connection manners:
  • the unit networks in the above-mentioned initial network are connected through a preset link mode, and the link modes between different unit networks may be the same or different.
  • connection modes between each unit network in the initial network are not limited.
  • the n unit networks include at least one second unit network, and the second unit network includes an input node, an output node, and at least one feature aggregation node connected by an operator of the second type ;
  • the second operation space corresponding to the second type of operation element contains the specified operation that depends on future data; the combination of one or more operations in the second operation space is used to realize the second type of operation element.
  • the search space of the initial network may also include a part of the specified operations that depend on future information (high latency/uncontrollable delay) operation, that is, the above-mentioned specified operation relying on future data, to ensure that the future information of the current speech data can be utilized while reducing the delay of speech recognition, thereby ensuring the accuracy of speech recognition.
  • future information high latency/uncontrollable delay
  • a topology structure is shared among at least one of the first unit networks, or a topology structure and network parameters are shared among at least one of the first unit networks; and a topology structure and network parameters are shared among at least one of the second unit networks.
  • the topology, or the topology and network parameters are shared among at least one of the second unit networks.
  • the initial network is divided into unit networks and divided into two or more different types of unit networks
  • the same The topology and network parameters are shared among types of cell networks.
  • the topology structure or network parameters may be shared among the unit networks of the same type.
  • the topology and network parameters may also be shared among some unit networks in the same type of unit network.
  • the initial network includes 4 first unit networks, of which 2 are the first unit networks.
  • One set of topology structure and network parameters is shared between the networks, and another set of topology structure and network parameters is shared between the other two first unit networks.
  • each unit network in the initial network may not share network parameters.
  • specified operations that are designed to be independent of future data are causal-based specified operations
  • Designation operations that are designed to be independent of future data are mask-based designation operations.
  • the specified operation does not depend on future data, which can be implemented in a causal manner, or can also be implemented in a mask-based manner.
  • future data which can be implemented in a causal manner, or can also be implemented in a mask-based manner.
  • other possible methods may also be used, which are not limited in the embodiments of the present application.
  • the feature aggregation node is configured to perform at least one of a summation operation, a concatenation operation, and a product operation on the input data.
  • the operation corresponding to each feature aggregation node in the initial network may be fixedly set as one operation, for example, fixedly set as a summation operation.
  • the above feature aggregation nodes may also be set to different operations, for example, some feature aggregation nodes are set to a summation operation, and some feature aggregation nodes are set to a splicing operation.
  • the above feature aggregation nodes may not be fixed as specific operations, wherein the operations corresponding to each feature aggregation node may be determined during the neural network structure search process.
  • the specified operation includes a convolution operation, a pooling operation, an operation based on a long short-term memory artificial neural network (Long Short-Term Memory, LSTM), and an operation based on a gated recurrent unit (Gated Recurrent Unit). , GRU) at least one of the operations.
  • GRU gated recurrent unit
  • the above specified operation may also include other convolutional neural network operations that depend on context information, and the embodiment of the present application does not limit the operation type of the specified operation.
  • the model training device performs a neural network structure search based on the initial network to determine a network search model with higher accuracy.
  • the model training device uses the voice training samples to The sub-network conducts machine learning training and evaluation to determine whether the feature aggregation nodes in the initial network are retained, whether each operator between the retained feature aggregation nodes is retained, the operation type corresponding to the retained operator, each operation source, and feature aggregation.
  • Information such as parameters of nodes is used to determine the subnet with suitable topology and accuracy requirements from the initial network, as the network search model obtained by the search.
  • FIG. 5 shows a schematic diagram of a network structure involved in an embodiment of the present application.
  • Figure 5 shows a schematic diagram of a NasNet-based search space, in which the macro part 51 of The connection between cells (unit networks) is bi-chain-styled, and the node structure of the micro portion 52 is op_type (operation type)+connection (connection point).
  • NAS Neural Architecture Search
  • the link mode of the macro structure part is bi-chain-styled, the input of each cell is the output of the first two cells, the link mode is a fixed artificially designed topology, and does not participate in the search; the number of layers of cells is variable, The search phase and the evaluation phase (based on the searched structure) can be different, and the number of layers of cells can also be different for different tasks.
  • the linking method of the macro structure can also participate in the search, that is, the non-fixed bi-chain-styled linking method, which is not limited in the embodiments of this application.
  • Micro structure is the topology structure in the cell as shown in Figure 5, which can be regarded as a directed acyclic graph.
  • the nodes IN(1) and IN(2) are the input nodes (nodes) of the cell
  • node1, node2, node3, and node4 are intermediate nodes, corresponding to the above feature aggregation nodes (the number is variable); the input of each node It is the output of all the previous nodes, that is, the input of node node1 is IN(1), IN(2), the input of node node2 is IN(1), IN(2), node1, and so on;
  • node OUT is the output node, Its input is the output of all intermediate nodes.
  • the NAS algorithm searches for an optimal link relationship (ie, topology).
  • a fixed set of candidate operations ie operation space
  • a summarization function set that is, various feature aggregation operations
  • the NAS algorithm keeps a best candidate operation/function based on all candidate operations/functions when performing neural network structure search based on training samples.
  • the following search algorithm descriptions are described as examples of this kind of search space.
  • the above-mentioned summarization function can also be fixedly set to other functions, or, the summarization function can also not be fixedly set.
  • the macro structure is designed as two cell structures:
  • the reduction cell is fixed to 2 layers, which are located at 1/3 and 2/3 of the entire network respectively, and the rest are normal cells.
  • the application examples shown in the embodiments of the present application are introduced by taking the same macro structure as the DARTS method as an example, and the descriptions of the macro structure below are all the above-mentioned topology structures, which will not be repeated.
  • the search algorithm Based on the above search space, the search algorithm generates the final microstructure, where normal cells share the same topology and corresponding operations, and reduction cells share the same topology and corresponding operations.
  • the normal cell and reduction cell in the network structure generated by the NAS algorithm since both the convolution operation and the pooling operation depend on future information (relative to the current moment), the normal cell and reduction cell in the network structure generated by the NAS algorithm generate delays respectively; for different tasks, the normal cell and the reduction cell are delayed.
  • the number of layers of the cell will change, and the delay will also change accordingly; based on the above principle, the delay of the generated network structure will increase with the increase of the number of network layers.
  • the search space is mainly based on convolutional neural networks
  • the input speech feature is a feature map (which can be understood as a picture), that is, the speech feature is the FBank second-order difference feature (40-dimensional feature map).
  • Resolution 40 dimensions
  • high corresponds to the length of speech (number of frames).
  • FIG. 6 shows a schematic diagram of a convolution operation involved in an embodiment of the present application.
  • the first row on the lower side is the input (one frame per column)
  • the middle is the hidden layer (each layer undergoes a 3*3 convolution operation)
  • the upper side is the input.
  • the dots with pattern filling on the left are the padding frames.
  • Figure 6 shows a schematic diagram of applying 3 layers of 3*3 convolution operations.
  • the unfilled dots in the Output layer are the output of the first frame.
  • the coverage of the solid arrow in the Input layer is all dependent information, that is, the next three frames of input information are required.
  • the logic of other candidate operations is similar, and the dependence of future information will increase with the increase of hidden layers.
  • FIG. 7 shows a schematic diagram of another convolution operation involved in the embodiment of the present application.
  • the input speech data goes through two hidden layers, the first hidden layer contains a 3*3 convolution operation, the second hidden layer contains a 5*5 convolution operation; the first 3* 3
  • the convolution operation needs to use the information of one frame of history and the information of one frame in the future to calculate the output of the current frame; the second 5*5 convolution operation, the input is the output of the first hidden layer, and the history needs to be used The information of the two frames and the information of the next two frames to calculate the output of the current frame.
  • the embodiment of the present application proposes a latency-controlled NAS algorithm.
  • the algorithm shown in the embodiment of the present application proposes a delay-controlled cell structure, which replaces the normal cell, that is, the macro of the new algorithm.
  • the structure consists of both latency-free cells and reduction cells.
  • the Latency-free cell structure is designed as a delay-free structure.
  • the cell itself will not cause delay.
  • the advantage of this structure design is that when the searched network structure is migrated to various tasks, increasing or decreasing the number of Latency-free cells will not change the delay of the entire network, and the delay is completely determined by the fixed number of reductions. The cell determines that the delay can be controlled while reducing the delay.
  • the implementation scheme of the latency-free cell structure design is that the candidate operations in the cell (that is, the operation space, such as convolution operation, pooling operation, etc.) are designed as a delay-free operation mode.
  • a delay-free design scheme can change the convolution operation from a traditional convolution operation to a causal convolution.
  • the operation of traditional convolution can refer to the above-mentioned Figure 6 and Figure 7, and the corresponding description of relying on future information.
  • FIG. 8 shows a schematic diagram of a causal convolution involved in an embodiment of the present application.
  • the difference between causal convolution and ordinary convolution is that the output of the white-filled dots in the Output layer corresponds to the coverage of the solid arrows in the Input layer, that is, the calculation at the current moment only depends on the past information , will not rely on future information.
  • FIG. 9 shows a schematic diagram of another causal convolution involved in the embodiment of the present application. As shown in FIG. 9 , compared with the traditional operation, the input of the causal convolution has to go through two hidden steps.
  • the first hidden layer contains a 3*3 convolution operation
  • the second hidden layer contains a 5*5 convolution operation
  • the first 3*3 convolution operation requires the use of two historical frames of information to calculate The output of the current frame
  • the second 5*5 convolution operation the input is the output of the first hidden layer, and the historical four-frame information needs to be used to calculate the output of the current frame.
  • the macro structure consists of a latency-free cell and a reduction cell
  • the microstructure of the latency-free cell consists of candidate operations without delay to form a search space.
  • the delay of the model is only determined by a fixed number of reduction cells, which can generate a low-latency streaming speech recognition model network structure.
  • the application example in the embodiment of the present application uses the bi-chain-styled cell structure as the implementation scheme, and optionally, it can also be extended to more structures in the following ways:
  • the macro structure level is based on the design of the cell structure, and the link between cells can also include chain-styled, densely-connected, etc.
  • the candidate operation design without delay In the design direction of the Micro structure, the candidate operation design without delay, the application example of the embodiment of this application is the causal method, and optionally, the candidate operation design without delay can also be realized in a mask-based manner, for example,
  • the above convolution operation can be implemented as a convolution operation based on a Pixel Convolutional Neural Network (Pixel CNN).
  • Step 403 the model training device constructs a speech recognition model based on the network search model.
  • the speech recognition model is used to process the input streaming speech data to obtain speech recognition text corresponding to the streaming speech data.
  • the model training device when the purpose of performing a model search on the initial network is to construct an acoustic model with high accuracy, the model training device can construct an acoustic model based on the network search model; the acoustic model is used for The streaming speech data is processed to obtain acoustic recognition information of the streaming speech data; and then a speech recognition model is constructed based on the acoustic model and the decoding map.
  • a speech recognition model usually includes an acoustic model and a decoding map, where the acoustic model is used to identify acoustic recognition information, such as phonemes, syllables, etc., from the input speech data, and the decoding map is used to identify according to the acoustic model.
  • the acoustic recognition information of the corresponding recognition text is obtained.
  • the decoding graph usually includes, but is not limited to, a phoneme/syllable dictionary and a language model, wherein the phoneme/syllable dictionary usually contains a mapping of words or words to phoneme/syllable sequences.
  • the phoneme/syllable dictionary can output the corresponding words or words;
  • the phoneme/syllable dictionary has nothing to do with the field of the text, and is a common part in different recognition tasks;
  • the language model is usually composed of n-gram
  • the (n-ary) language model is converted from a language model used to calculate the probability of a sentence appearing, which is trained using training data and statistical methods.
  • texts in different fields such as texts of news and spoken dialogues, have great differences in common words and collocations between words. Therefore, when performing speech recognition in different fields, the language model can be changed to achieve adaptation.
  • the neural network structure delay obtained by the search is only determined by a fixed number of reduction cells.
  • the model delay after migration will not change with As the number of cell layers in the model structure changes, especially for large-scale speech recognition tasks, the model structure after migration is very complex (with a large number of cell layers), and it is difficult for traditional NAS algorithms to effectively control the delay.
  • the design of the new algorithm can ensure that the delay of the migrated model structure is fixed, and is suitable for various speech recognition tasks, including large-scale speech recognition tasks.
  • the application example of this application can generate low-latency streams for large-scale speech recognition tasks. Recognition model network structure.
  • Step 404 the speech recognition device receives streaming speech data.
  • the speech recognition model After the above speech recognition model is constructed, it can be deployed to a speech recognition device to perform the task of recognizing streaming speech.
  • the speech acquisition device in the streaming speech recognition scenario can continuously collect the streaming speech and input it to the speech recognition device.
  • Step 405 the speech recognition device processes the streaming speech data through the speech recognition model to obtain speech recognition text corresponding to the streaming speech data.
  • the speech recognition model includes an acoustic model and a decoding map, and the acoustic model is constructed based on the network search model;
  • the speech recognition device can process the streaming speech data through the acoustic model to obtain acoustic identification information of the streaming speech data; the acoustic identification information includes phonemes, syllables or semi-syllables; The acoustic recognition information of the data is processed to obtain the speech recognition text.
  • the speech recognition device can pass the acoustic model in the speech recognition model to convection Then, the acoustic recognition information is input into the decoding map composed of speech dictionary, language model, etc. for decoding, and the corresponding speech recognition text is obtained.
  • Step 406 the speech recognition device outputs the speech recognition text.
  • the speech recognition text can be applied to subsequent processing, for example, the speech recognition text or its translated text is displayed as subtitles, or the translation of the speech recognition text Text-to-speech playback, and more.
  • the specified operation that needs to depend on context information is set to the specified operation that does not depend on future data, and then The initial network performs a neural network structure search to build a speech recognition model. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • FIG. 10 is a schematic diagram of a model construction and speech recognition framework according to an exemplary embodiment.
  • the model training device first read the preset operation space 1012 from the operation space memory 1011 (the specified operation is designed not to depend on future data), and read the preset speech training samples (including speech sample and corresponding syllable information), according to the preset voice training sample and the preset operation space 1012, perform a neural network structure search on the preset initial network 1013 (such as the network shown in FIG. 5 above) to obtain a network Search Model 1014.
  • the model training device builds an acoustic model 1015 based on the network search model 1014.
  • the input of the acoustic model 1015 can be the syllables corresponding to the speech data and the historical recognition results of the speech data, and the output is the predicted syllables of the current speech data.
  • the model training device constructs a speech recognition model 1017 based on the above-mentioned acoustic model 1015 and the preset decoding map 1016, and deploys the speech recognition model 1017 into the speech recognition device.
  • the speech recognition device obtains the streaming speech data 1018 collected by the speech acquisition device, and after segmenting the streaming speech data 1018, the segmented speech frames are input into the speech recognition model 1017, where The speech recognition model 1017 performs recognition to obtain speech recognition text 1019 , and outputs the speech recognition text 1019 so as to perform operations such as presentation/translation/natural language processing on the speech recognition text 1019 .
  • Fig. 11 is a block diagram showing the structure of a speech recognition apparatus according to an exemplary embodiment.
  • the speech recognition apparatus can implement all or part of the steps in the method provided by the embodiment shown in FIG. 2 or FIG. 4 , and the speech recognition apparatus includes:
  • the voice data receiving module 1101 is used for receiving streaming voice data.
  • the speech data processing module 1102 is configured to process the streaming speech data through a speech recognition model to obtain speech recognition text corresponding to the streaming speech data; the speech recognition model is to perform a neural network structure search on the initial network obtained; the initial network includes a plurality of feature aggregation nodes connected by the first type of operator, the operation space corresponding to the first type of operator is the first operation space, and the first operation space depends on the context Specified actions on information are designed to be independent of future data.
  • the initial network includes n unit networks, the n unit networks include at least one first unit network, and the first unit network includes an input node, an output node, and a At least one of the feature aggregation nodes to which the first type of operand is connected.
  • the n unit networks are connected by at least one of the following connection manners:
  • Dual link mode single link mode, and dense link mode.
  • the n unit networks include at least one second unit network, and the second unit network includes an input node, an output node, and at least one connected unit connected by an operator of the second type.
  • the feature aggregation node; the second operation space corresponding to the second type of operation element contains the specified operation that depends on future data; the combination of one or more operations in the second operation space is used to realize the The second type of operand.
  • the topology structure and network parameters are shared among at least one of the first unit networks, and the topology structure and network parameters are shared among at least one of the second unit networks.
  • the specified operation that is designed to be independent of future data is the specified operation based on cause and effect;
  • Designated operations that are designed to be independent of future data are mask-based.
  • the feature aggregation node is configured to perform at least one of a summation operation, a concatenation operation, and a product operation on the input data.
  • the specified operation includes at least one of a convolution operation, a pooling operation, an operation based on a long short-term memory artificial neural network LSTM, and an operation based on a gated recurrent unit GRU.
  • the speech recognition model includes an acoustic model and a decoding map, and the acoustic model is constructed based on a network search model, and the network search model is performed on the initial network through speech training samples. Obtained by neural network structure search;
  • the voice data processing module 1102 is used to:
  • the streaming voice data is processed by the acoustic model to obtain acoustic identification information of the streaming voice data;
  • the acoustic identification information includes phonemes, syllables or semi-syllables;
  • the speech recognition text is obtained by processing the acoustic recognition information of the streaming speech data through the decoding map.
  • the specified operation that needs to depend on context information is set to not depend on future data, and then the initial network is performed.
  • Neural network structure search to build speech recognition models. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • Fig. 12 is a block diagram showing the structure of a speech recognition apparatus according to an exemplary embodiment.
  • the speech recognition apparatus can implement all or part of the steps in the method provided by the embodiment shown in FIG. 3 or FIG. 4 , and the speech recognition apparatus includes:
  • a sample acquisition module 1201 configured to acquire a voice training sample, where the voice training sample includes a voice sample and a voice recognition label corresponding to the voice sample;
  • the network search module 1202 is configured to perform a neural network structure search on the initial network based on the voice training samples to obtain a network search model;
  • the initial network includes a plurality of feature aggregation nodes connected by the first type of operator, the The operation space corresponding to the first type of operation element is the first operation space, and the specified operation that depends on the context information in the first operation space is designed not to depend on future data;
  • the model building module 1203 is used for building a speech recognition model based on the network search model; the speech recognition model is used for processing the input streaming speech data to obtain speech recognition text corresponding to the streaming speech data.
  • the speech recognition tag includes acoustic identification information of the speech sample;
  • the acoustic identification information includes phonemes, syllables or semi-syllables;
  • the model building module 1203 is used to,
  • the acoustic model is used to process the streaming voice data to obtain acoustic identification information of the streaming voice data;
  • the speech recognition model is constructed based on the acoustic model and the decoding map.
  • the specified operation that needs to depend on context information is set to not depend on future data, and then the initial network is performed.
  • Neural network structure search to build speech recognition models. Since a specified operation that does not depend on future data is introduced into the model, and a model structure with higher accuracy can be searched through neural network structure search, the above solution can reduce the cost of speech recognition while ensuring the accuracy of speech recognition.
  • the recognition delay in streaming speech recognition scenarios improves the effect of streaming speech recognition.
  • Fig. 13 is a schematic structural diagram of a computer device according to an exemplary embodiment.
  • the computer device may be implemented as the model training device and/or the speech recognition device in each of the above method embodiments.
  • the computer device 1300 includes a central processing unit 1301, a system memory 1304 including a random access memory (Random Access Memory, RAM) 1302 and a read-only memory (Read-Only Memory, ROM) 1303, and a connection between the system memory 1304 and the central processing unit.
  • the computer device 1300 also includes a basic input/output system 1306 that facilitates the transfer of information between various components within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.
  • the mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305 .
  • the mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300 . That is, the mass storage device 1307 may include a computer-readable medium (not shown) such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • a computer-readable medium such as a hard disk or a Compact Disc Read-Only Memory (CD-ROM) drive.
  • the computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, flash memory, or other solid state storage technology, CD-ROM, or other optical storage, magnetic tape cartridges, magnetic tape, magnetic disk storage, or other magnetic storage devices.
  • RAM random access memory
  • ROM read only memory
  • flash memory or other solid state storage technology
  • CD-ROM Compact Disc
  • magnetic tape cartridges magnetic tape
  • magnetic disk storage magnetic disk storage devices
  • the computer device 1300 may be connected to the Internet or other network devices through a network interface unit 1311 connected to the system bus 1305 .
  • the memory also includes at least one computer instruction, the at least one computer instruction is stored in the memory, and the processor implements all or part of the method shown in FIG. 2, FIG. 3 or FIG. 4 by loading and executing the at least one computer instruction step.
  • non-transitory computer-readable storage medium including instructions, such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
  • instructions such as a memory including a computer program (instructions) executable by a processor of a computer device to complete the present application
  • the non-transitory computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD) -ROM), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods shown in the foregoing embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音识别方法、装置、计算机设备及存储介质。包括:接收流式语音数据(21);通过语音识别模型对流式语音数据进行处理,获得语音识别文本;语音识别模型是通过对初始网络进行神经网络结构搜索获得的;初始网络中包含通过第一类型操作元相连的多个特征聚合节点,第一类型操作元对应的操作空间为第一操作空间,且第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据(22);输出语音识别文本(23)。通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。

Description

语音识别方法、装置、计算机设备及存储介质
本申请要求于2021年01月12日提交,申请号为202110036471.8、发明名称为“语音识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请涉及语音识别技术领域,特别涉及一种语音识别方法、装置、计算机设备及存储介质。
背景技术
语音识别是一种将语音识别为文本的技术,其在各种人工智能(Artificial Intelligence,AI)场景中具有广泛的应用。
在相关技术中,为了保证语音识别的准确性,语音识别模型对输入的语音进行识别的过程中,需要参考语音的上下文信息,也就是说,在对语音数据进行识别时,需要同时结合该语音数据的历史信息和未来信息进行识别。
在上述技术方案中,由于语音识别模型在语音识别过程中引入了未来信息,会导致一定的延时,从而限制来语音识别模型在流式语音识别中的应用。
发明内容
本申请实施例提供了一种语音识别方法、装置、计算机设备及存储介质,可以降低在流式语音识别场景下的识别时延,提高流式语音识别的效果,该技术方案如下:
一方面,本申请实施例提供了一种语音识别方法,用于计算机设备,所述方法包括:
接收流式语音数据;
通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本;所述语音识别模型是通过对初始网络进行神经网络结构搜索获得的;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,且所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
输出所述语音识别文本。
又一方面,本申请实施例提供了一种语音识别方法,用于计算机设备,所述方法包括:
获取语音训练样本,所述语音训练样本中包含语音样本,以及所述语音样本对应的语音识别标签;
基于所述语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
基于所述网络搜索模型构建语音识别模型;所述语音识别模型用于对输入的流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本。
又一方面,本申请实施例提供了一种语音识别装置,所述装置包括:
语音数据接收模块,用于接收流式语音数据。
语音数据处理模块,用于通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本;所述语音识别模型是通过对初始网络进行神经网络结构搜索获得的;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,且所述第一操作空间中依赖上下文信息 的指定操作被设计为不依赖未来数据;
文本输出模块,用于输出所述语音识别文本。
又一方面,本申请实施例提供了一种语音识别装置,所述装置包括:
样本获取模块,用于获取语音训练样本,所述语音训练样本中包含语音样本,以及所述语音样本对应的语音识别标签;
网络搜索模块,用于基于所述语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
模型构建模块,用于基于所述网络搜索模型构建语音识别模型;所述语音识别模型用于对输入的流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本。
再一方面,本申请实施例提供了一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条计算机指令,所述至少一条计算机指令由所述处理器加载并执行以实现上述的语音识别方法。
又一方面,本申请实施例提供了一种计算机可读存储介质,所述存储介质中存储有至少一条计算机指令,所述至少一条计算机指令由处理器加载并执行以实现上述语音识别方法。
又一方面,本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述语音识别方法。
通过将初始网络中第一类型操作元对应的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
附图说明
图1是根据一示例性实施例示出的一种模型搜索及语音识别框架图;
图2是根据一示例性实施例示出的一种语音识别方法的流程示意图;
图3是根据一示例性实施例示出的一种语音识别方法的流程示意图;
图4是根据一示例性实施例示出的一种语音识别方法的流程示意图;
图5是图4所示实施例涉及的网络结构示意图;
图6是图4所示实施例涉及的卷积操作示意图;
图7是图4所示实施例涉及的另一种卷积操作示意图;
图8是图4所示实施例涉及的一种因果卷积的示意图;
图9是图4所示实施例涉及的另一种因果卷积的示意图;
图10是根据一示例性实施例示出的一种模型构建及语音识别框架示意图;
图11是根据一示例性实施例示出的一种语音识别装置的结构方框图;
图12是根据一示例性实施例示出的一种语音识别装置的结构方框图;
图13是根据一示例性实施例示出的一种计算机设备的结构示意图。
具体实施方式
在对本申请所示的各个实施例进行说明之前,首先对本申请涉及到的几个概念进行介绍:
1)人工智能(Artificial Intelligence,AI)
人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
2)神经网络结构搜索(Neural Architecture Search,NAS)
神经网络结构搜索,是一种用算法来设计神经网络的策略,也就是在不确定网络的长度和结构的情况下,人为设定一定的搜索空间,并按照设计的搜索策略从搜索空间中寻找在验证集上表现最好的网络结构。
神经网络结构搜索技术从组成上包括搜索空间,搜索策略,评价预估三个部分,从实现上又分为基于强化学习的NAS,基于基因算法的NAS(也称为基于进化的NAS),以及可微分的NAS(也称为基于梯度的NAS)。
基于强化学习的NAS使用一个循环神经网络作为控制器来产生子网络,再对子网络进行训练和评估,得到其网络性能(如准确率),最后更新控制器的参数。然而,子网络的性能是不可导的,无法直接对控制器进行优化,只能采用强化学习的方式,基于策略梯度的方法更新控制器参数。然而受限于其离散优化的本质,这类方法太耗费计算资源,原因在于在该类NAS算法中,为了充分挖掘每个子网络的“潜力”,控制器每次采样一个子网络,都要初始化其网络权重从头训练然后验证其性能。对比之下,基于梯度优化的可微分NAS显示出了极大的效率优势。基于梯度优化的可微分NAS将整个搜索空间构建为一个超网(super-net),然后将训练和搜索过程建模为双级优化(bi-level optimization)问题,它并不会单独采样一个子网再从头开始训练验证其性能,由于超网本身就是由子网集合组成,因此其利用当前超网的准确率近似当前概率最大的子网的性能,因此其具有极高的搜索效率和性能,逐渐成为主流的神经网络结构搜索方法。
3)超网(super-network)
超网是在可微分NAS中包含所有可能的子网络的集合。开发人员可以设计一个大的搜索空间,这个搜索空间便组成一个超网,这个超网中包含多个子网,每个子网(sub-network)经过训练后都可以被评测性能指标,神经网络结构搜索需要做的便是从这些子网中找出性能指标最好的子网。
4)语音技术(Speech Technology,ST)
语音技术的关键技术有自动语音识别技术(AutomaticSpeechRecognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
本申请实施例的方案包括模型搜索阶段和语音识别阶段。图1是根据一示例性实施例示出的一种模型搜索及语音识别框架图。如图1所示,在模型搜索阶段,模型训练设备110通过预先设置好的语音训练样本对预设的初始网络中进行神经网络结构搜索,基于搜索结果构建出准确度较高的语音识别模型,在语音识别阶段,语音识别设备120根据构建的语音识别模型以及输入的流式语音数据,识别出流式语音数据中的语音识别文本。
其中,上述初始网络可以是指神经网络结构搜索中的搜索空间或者超网。上述搜索出的语音识别模型可以是超网中的一个子网。
其中,上述模型训练设备110和语音识别设备120可以是具有机器学习能力的计算机设备,比如,该计算机设备可以是个人电脑、服务器等固定式计算机设备,或者,该计算机设 备也可以是平板电脑、电子书阅读器等移动式计算机设备。
可选的,上述模型训练设备110和语音识别设备120可以是同一个设备,或者,模型训练设备110和语音识别设备120也可以是不同的设备。并且,当模型训练设备110和语音识别设备120是不同的设备时,模型训练设备110和语音识别设备120可以是同一类型的设备,比如模型训练设备110和语音识别设备120可以都是个人电脑;或者,模型训练设备110和语音识别设备120也可以是不同类型的设备。比如模型训练设备110可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。而语音识别设备120可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
在本申请各个实施例所示的方案中,上述模型训练设备通过对初始网络中进行神经网络结构搜索,并基于搜索结果构建得到的语音识别模型,其应用场景可以包括且不限于以下应用场景:
1、网络会议场景。
在跨国网络会议中,通常涉及到语音识别的应用,例如,对于流式的会议语音,通过语音识别模型识别出语音识别文本,并将语音识别文本展示在网络会议的显示屏中,在有需要的情况下,还可以对识别出的语音识别文本进行翻译后展示(比如,通过文字或者语音进行展示)。通过本申请涉及的语音识别模型,可以低延时的语音识别,从而满足网络会议场景中的即时语音识别。
2、视频/语音直播场景。
在网络直播中,也会涉及到语音识别的应用,例如,直播场景通常需要在直播画面中添加字幕。通常本申请涉及的语音识别模型,可以实现对直播流中的语音进行低延时的识别,从而能够尽快生成字幕并添加在直播数据流中,对于降低直播的时延有着很重要的意义。
3、即时翻译场景。
在很多会议中,当与会双方或多方使用不同的语言时,往往需要专门的翻译人员进行口译。通过本申请涉及的语音识别模型,可以实现对与会者发言的语音进行低延时的识别,从而快速展示识别出的文本并通过显示屏或者翻译后的语音进行展示,从而实现自动化的即时翻译。
图2是根据一示例性实施例示出的一种语音识别方法的流程示意图。该方法可以由上述图1所示实施例中的语音识别设备执行。如图2所示,该语音识别方法可以包括如下步骤:
步骤21,接收流式语音数据。
可选的,该流式(Streaming)语音数据是对实时语音进行编码所生成的音频流数据,且流式语音数据对语音识别的时延需求较高,即需要保证输入流式语音数据到输出语音识别结果之间的时延较短。
步骤22,通过语音识别模型对该流式语音数据进行处理,获得该流式语音数据对应的语音识别文本;该语音识别模型是通过对初始网络进行神经网络结构搜索获得的;该初始网络中包含通过第一类型操作元相连的多个特征聚合节点,该第一类型操作元对应的操作控件为第一操作空间,且第一操作控件中依赖上下文信息的指定操作被设计为包不依赖未来数据。
其中,该语音识别模型为流式语音识别模型(Streaming ASR Model)。不同于非流式语音识别模型在处理非流式语音数据时,必须在处理完整句音频后才反馈语音识别结果,利用流式语音识别模型处理流式语音数据时支持实时返回语音识别结果。
其中,上述未来数据,是指在时域上位于当前识别的语音数据之后的其它语音数据。对 于依赖未来数据的指定操作,通过该指定操作对当前语音数据进行识别时,需要等待未来数据到达,才能完成对当前语音数据的识别,这会导致一定的延时,且随着此类操作的增加,对当前语音数据完成识别的延时也会随之增加。
而对于不依赖未来数据的指定操作,通过该指定操作对当前语音数据进行识别时,不需要等待未来数据到达即可以完成对当前语音数据的识别,在此过程中不会引入等待未来数据而导致的延时。
在一种可能的实现方式中,上述不依赖未来数据的指定操作,是指在对语音数据进行特征处理过程中,基于当前语音数据,以及当前语音数据之前的历史数据即可以完成处理过程的操作。
步骤23,输出该语音识别文本。
综上所述,本申请实施例所示的方案,通过将初始网络中第一类型操作元对应的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
图3是根据一示例性实施例示出的一种语音识别方法的流程示意图。该方法可以由上述图1所示实施例中的模型训练设备执行,该语音识别方法可以是基于神经网络结构搜索执行的方法。如图3所示,该语音识别方法可以包括如下步骤:
步骤31,获取语音训练样本,该语音训练样本中包含语音样本,以及该语音样本对应的语音识别标签。
步骤32,基于该语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;该初始网络中包含通过第一类型操作元相连的多个特征聚合节点,该第一类型操作元对应的操作空间为第一操作空间,第一操作空间中依赖上下文信息的指定操作被涉及为不依赖未来数据。
为了降低语音识别时延,本申请实施例对传统的NAS方案进行了改进,将操作空间中原先依赖历史数据以及未来数据的指定操作(神经网络操作)设计为仅依赖历史数据,即将指定操作设计为无时延方式,使后续神经网络结构搜索过程中搜索到低时延的神经网络结构。
可选的,该第一类型操作元由第一操作空间中的至少一种操作组合得到。
步骤33,基于该网络搜索模型构建语音识别模型;该语音识别模型用于对输入的流式语音数据进行处理,获得该流式语音数据对应的语音识别文本。
综上所述,本申请实施例所示的方案,通过将初始网络中第一类型操作元对应的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
图4是根据一示例性实施例示出的一种语音识别方法的流程示意图。该方法可以由模型训练设备和语音识别设备执行,其中,该模型训练设备和语音识别设备可以实现为单个计算机设备,也可以分属于不同的计算机设备。该方法可以包括以下步骤:
步骤401,模型训练设备获取语音训练样本,该语音训练样本中包含语音样本,以及该语音样本对应的语音识别标签。
其中,语音训练样本是开发人员预先收集的样本集合,该语音训练样本中包含各个语音样本,以及语音样本对应的语音识别标签,该语音识别标签用于在后续的网络结构搜索过程 中进行模型的训练和评估。
在一种可能的实现方式中,该语音识别标签包括该语音样本的声学识别信息;该声学识别信息包括音素、音节或者半音节。
其中,当本申请所示的方案中,通过对初始网络进行模型搜索的目的是构建准确性较高的声学模型时,该语音识别标签可以是与声学模型的输出结果相对应的信息,比如,音素、音节或者半音节等等。
在一种可能的实现方式中,上述语音样本可以预先切分为若干个带有重叠的短时语音片段(也称为语音帧),每个语音帧对应有各自的音素、音节或半音节。例如,一般对于采样率为16K的语音,切分后一帧语音长度为25ms,帧间重叠为15ms,此过程也称为“分帧”。
步骤402,模型训练设备基于该语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型。
其中,该初始网络中包含通过操作元相连的多个特征聚合节点,该多个特征聚合节点之间的操作元中包括第一类型操作元,该第一类型操作元对应的第一操作空间中包含的依赖上下文信息的指定操作被设计为不依赖未来数据;该第一操作空间中的一种或者多种操作的组合用于实现该第一类型操作元;该指定操作为依赖上下文信息的神经网络操作。
在本申请实施例中,上述第一操作空间中除了包含依赖上下文信息的指定操作之外,还可以包含不依赖上下文信息的操作,比如残差连接操作等,本申请实施例对于第一操作空间中包含的操作类型不做限定。
在一种可能的实现方式中,该初始网络包含n个单元网络,该n个单元网络中包含至少一个第一单元网络,该第一单元网络中包含输入节点、输出节点、以及由该第一类型操作元相连的至少一个该特征聚合节点。
在一个示例性的方案中,上述初始网络可以按照单元网络进行划分,每个单元网络包含输入节点和输出节点,以及输入节点和输出节点之间的一个或多个特征聚合节点。
其中,初始网络中的各个单元网络的搜索空间可以相同,也可以不同。
在一种可能的实现方式中,该n个单元网络之间通过以下连接方式中的至少一种相连:
双链接方式(bi-chain-styled)、单链接方式(chain-styled)、以及密集链接方式(densely-connected)。
在一个示例性的方案中,上述初始网络中的单元网络之间通过预先设置的链接方式相连,且不同的单元网络之间的链接方式可以相同,也可以不同。
在本申请实施例所示的方案中,对于初始网络中的各个单元网络之间的连接方式不做限定。
在一种可能的实现方式中,该n个单元网络中包含至少一个第二单元网络,该第二单元网络中包含输入节点、输出节点、以及由第二类型操作元相连的至少一个特征聚合节点;该第二类型操作元对应的第二操作空间中包含依赖未来数据的该指定操作;该第二操作空间中的一种或者多种操作的组合用于实现该第二类型操作元。
可选的,除了上述不依赖未来信息(低延时/延时可控)的指定操作之外,初始网络的搜索空间中还可以包含一部分需要依赖未来信息(高延时/延时不可控)的指定操作,即上述依赖未来数据的指定操作,以保证在降低语音识别时延的同时,能够利用到当前语音数据的未来信息,从而保证语音识别的准确性。
在一种可能的实现方式中,至少一个该第一单元网络之间共享拓扑结构,或者,至少一个该第一单元网络之间共享拓扑结构和网络参数;至少一个该第二单元网络之间共享拓扑结构,或者,至少一个该第二单元网络之间共享拓扑结构和网络参数。
在一个示例性的方案中,当初始网络以单元网络进行划分,且分为两种或者两种以上不同的类型的单元网络时,为了降低网络搜索的复杂度,在搜索过程中,可以在同类型的单元网络中共享拓扑结构和网络参数。
在其它可能的实现方案中,在搜索过程中,可以在同类型的单元网络中共享拓扑结构, 或者,共享网络参数。
在其它可能的实现方案中,也可以在同类型的单元网络中的部分单元网络之间共享拓扑结构和网络参数,例如,假设初始网络中包含4个第一单元网络,其中2个第一单元网络之间共享一套拓扑结构和网络参数,另外2个第一单元网络之间共享另一套拓扑结构和网络参数。
在其它可能的实现方案中,初始网络中的各个单元网络也可以不共享网络参数。
在一种可能的实现方式中,被设计为不依赖未来数据的指定操作是基于因果(causal)的指定操作;
或者,
被设计为不依赖未来数据的指定操作是基于掩膜(mask-based)的指定操作。
其中,对于指定操作不依赖未来数据,可以通过因果方式实现,或者,也可以通过基于掩膜的方式实现。当然,除了采用因果和掩膜方式使指定操作不依赖未来数据外,还可以采用其他可能的方式,本申请实施例并不对此构成限定。
在一种可能的实现方式中,该特征聚合节点用于对输入数据执行求和操作、拼接操作以及乘积操作中的至少一种。
在一个示例性的方案中,初始网络中的各个特征聚合节点对应的操作可以固定设置为一种操作,比如,固定设置为求和操作。
或者,在其它可能的实现方案中,上述特征聚合节点也可以分别设置为不同的操作,比如,部分特征聚合节点设置为求和操作,部分特征聚合节点设置为拼接操作。
或者,在其它可能的实现方案中,上述特征聚合节点也可以不固定为特定的操作,其中,各个特征聚合节点对应的操作可以在神经网络结构搜索过程中确定。
在一种可能的实现方式中,该指定操作包括卷积操作、池化操作、基于长短期记忆人工神经网络(Long Short-Term Memory,LSTM)的操作、以及基于门控循环单元(Gated Recurrent Unit,GRU)的操作中的至少一种。或者,上述指定操作也可以包含其他依赖上下文信息的卷积神经网络操作,本申请实施例对于指定操作的操作类型不做限定。
在本申请实施例中,模型训练设备基于初始网络进行神经网络结构搜索,以确定准确性较高的网络搜索模型,在上述搜索过程中,模型训练设备通过语音训练样本,对初始网络中的各个子网进行机器学习训练和评估,以确定初始网络中的特征聚合节点是否保留、保留的特征聚合节点之间的各个操作元是否保留、保留的操作元对应的操作类型、各个操作源以及特征聚合节点的参数等信息,以从初始网络中确定出拓扑结构合适且准确性满足要求的子网,作为搜索获得的网络搜索模型。
请参考图5,其示出了本申请实施例涉及的一种网络结构示意图。如图5所示,以基于cell结构的传统神经网络结构搜索(Neural Architecture Search,NAS)方法为例,图5给出了一种NasNet-based搜索空间的示意图,其中宏观(macro)部分51的cell(单元网络)之间的连接方式为bi-chain-styled方式,微观(micro)部分52的节点结构为op_type(操作类型)+connection(连接点)。
本申请实施例所示的方案基于图5所示的拓扑结构,下文对于搜索空间的描述均以这种拓扑结构为例进行描述。其中,如图5所示,搜索空间的构建通常分为两步:宏观结构(macro architecture)和微观结构(micro architecture)。
其中,macro structure部分的链接方式为bi-chain-styled,每个cell的输入为前两个cell的输出,链接方式为固定的人工设计拓扑,不参与搜索;cell的层数是可变的,搜索阶段与评估阶段(基于已搜索到的结构)可以不一样,面向不同的任务时,cell的层数也可以不一样。
需要注意的是,有些NAS算法中,macro structure的链接方式也可以参与搜索,即非固定的bi-chain-styled链接方式,本申请实施例并不对此构成限定。
Micro structure为cell内的拓扑结构如图5所示,可以看做一个有向无环图。其中,节 点IN(1)、IN(2)为cell的输入节点(node),node1、node2、node3、node4为中间节点,对应上述特征聚合节点(数目是可变的);每个节点的输入为前面所有节点的输出,即节点node1的输入为IN(1)、IN(2),节点node2的输入为IN(1)、IN(2)、node1,以此类推;节点OUT为输出节点,其输入为所有中间节点的输出。
NAS算法基于上述初始模型中的链接关系,搜索出一个最佳的链接关系(即拓扑结构)。每两个节点之间预定义了一个固定的候选操作集合(即操作空间),比如3x3convolution(卷积)、3x3average pooling(平均池化)等操作,分别用于对节点的输入进行处理;候选操作对输入进行处理后预定义了一个summarization function集合(即各类特征聚合操作),比如sum(求和)、concat(合并)、product(乘积)等函数。NAS算法在基于训练样本进行神经网络结构搜索时,基于所有候选操作/函数,保留一个最佳的候选操作/函数。需要注意的是,本方案中的应用实例可以固定summarization function=sum函数,只对cell内的拓扑结构,以及候选操作进行搜索,下文搜索算法描述均为这种搜索空间为例进行介绍。可选的,上述summarization function也可以固定设置为其它函数,或者,summarization function也可以不固定设置。
在面向流式语音识别任务中,传统的NAS方法很难生成低延时的流式语音识别模型网络结构。以DARTS-based搜索空间为例,macro structure(宏观结构)设计为两种cell结构:
normal cell,输入和输出的时频域分辨率保持不变;以及,reduction cell,输出的时频域分辨率为输入的一半。
其中,reduction cell固定为2层,分别位于整个网络的1/3和2/3处,其他处均为normal cell。本申请实施例所示的应用实例,以macro structure与DARTS方法相同为例进行介绍,下文对于macro structure的描述均为上述拓扑结构,不再赘述。基于上述搜索空间,搜索算法生成最终的micro structure,其中normal cell共享同一个拓扑结构以及对应的操作,reduction cell共享同一个拓扑结构以及对应的操作。DARTS-based搜索空间内,由于卷积操作和池化操作都会依赖未来信息(相对于当前时刻),因此NAS算法生成的网络结构中normal cell和reduction cell分别产生延时;针对不同的任务,normal cell的层数会进行改变,那么延时也会随之进行改变;基于上述原理,生成的网络结构延时会随着网络层数的增加而增加。为更加清晰地描述上述延时的概念,以生成的网络结构中normal cell的延时为4帧,reduction cell的延时为6帧为例,计算5层cells的网络延时=4+6+2*(4+6+2*(4))=46帧,算式中的数字2是由reduction cell中时频域分辨率减半而添加的乘法计算因子;进一步的,计算8层cells的网络延时=(4+4)+6+2*((4+4)+6+2*(4+4))=74帧,以此类推。显而易见,在增加cell的层数时,整个网络的延时也会快速增长。
为了清晰地理解NAS算法中语音的延时概念,下面以卷积神经网络中的卷积操作为例,介绍指定操作的实现过程。本申请实施例涉及的应用实例中,搜索空间是以卷积神经网络为主,输入的语音特征为feature map(可以理解为一幅图片),即语音特征为FBank二阶差分特征(40-dimensional log Mel-filterbank features with the firstorder and the second-order derivatives),其中一阶和二阶差分特征分别对应到额外的通道(图片中的channel概念)中,语音特征的feature map,宽对应为频域分辨率(40维),高对应为语音的长度(帧数)。
语音feature map经过传统的候选操作处理时,一般会依赖未来信息。请参考图6,其示出了本申请实施例涉及的一种卷积操作示意图。如图6所示,以3*3卷积操作为例,下侧第一行为输入(每一列为一帧),中间为隐藏层(每一层经过一次3*3卷积操作),上侧为输出,左侧有图案填充的圆点为padding(填充)帧,图6所示为应用3层3*3卷积操作的示意图,Output(输出)层无填充圆点为第一帧的输出,Input(输入)层实线箭头的覆盖范围为所有依赖的信息,即需要未来三帧输入信息。其他候选操作的逻辑类似,未来信息的依赖会随着隐藏层的增加而增加。
更加直观地,请参考图7,其示出了本申请实施例涉及的另一种卷积操作示意图。如图7所示,输入的语音数据要经过两个隐层,第一个隐层包含一个3*3卷积操作,第二个隐层 包含一个5*5卷积操作;第一个3*3卷积操作,需要使用历史的一帧信息和未来的一帧的信息,来计算当前帧的输出;第二个5*5卷积操作,输入为第一个隐层的输出,需要使用历史的两帧信息和未来的两帧的信息,来计算当前帧的输出。
基于以上介绍,传统的NAS方法很难去有效地控制搜索得到网络结构的延时,尤其是在大规模语音识别任务中,网络结构的cell层数更多,对应的延时呈线性增加。面向流式语音识别任务,针对传统NAS算法中存在的问题,本申请实施例提出了一种延时可控的(latency-controlled)NAS算法。不同于传统算法中的normal cell和reduction cell结构设计,本申请实施例所示的算法提出了一种延时可控(latency-controlled)cell结构,替代了其中的normal cell,即新算法的macro structure由latency-free cell和reduction cell二者组成。Latency-free cell结构为无时延结构设计,无论NAS算法最终搜索得到的micro structure是什么样的拓扑结构和候选操作,cell本身都不会产生时延。这种结构设计的优势是,搜索得到的网络结构在迁移到各种任务中时,增加和减少Latency-free cell的数目都不会改变整个网络的时延,其时延完全由固定数目的reduction cell确定,在降低时延的同时实现延时可控。
在本申请实施例的应用实例中,latency-free cell结构设计的实现方案为,cell内的候选操作(即操作空间,例如卷积操作、池化操作等)设计为无时延的操作方式。
以卷积操作为例,无时延的设计方案可以为卷积操作由传统的卷积操作变为因果(causal)卷积。传统卷积的操作可以参考上述图6和图7,以及对应依赖未来信息的描述。请参考图8,其示出了本申请实施例涉及的一种因果卷积的示意图。如图8所示,因果卷积与普通卷积方式的不同之处在于,Output层白色填充的圆点的输出,对应Input层实线箭头的覆盖范围,即当前时刻的计算只依赖过去的信息,不会依赖未来的信息。除了卷积操作之外,其他对未来信息有依赖的候选操作(例如池化操作),均可以采用上述类似的因果处理方法,也就是,对当前时刻的计算只依赖过去的信息。再例如,请参考图9,其示出了本申请实施例涉及的另一种因果卷积的示意图,如图9所示,与传统的操作进行对比,因果卷积的输入要经过两个隐层,第一个隐层包含一个3*3卷积操作,第二个隐层包含一个5*5卷积操作;第一个3*3卷积操作,需要使用历史的两帧信息,来计算当前帧的输出;第二个5*5卷积操作,输入为第一个隐层的输出,需要使用历史的四帧信息,来计算当前帧的输出。
本申请实施例提出的上述latency-controlled NAS算法,macro structure由latency-free cell和reduction cell组成,latency-free cell的micro structure由无时延的候选操作构成搜索空间。新算法搜索得到的神经网络结构,模型的时延只由固定数目的reduction cell确定,能够生成低延时的流式语音识别模型网络结构。
如前所述,本申请实施例中的应用实例是以bi-chain-styled cell结构为实现方案,可选的,也可以通过以下方式扩展至更多的结构:
1)Macro structure层面基于cell结构的设计,cell之间的链接方式还可以包含chain-styled、densely-connected等。
2)Macro structure层面上,结构的设计类似于cell结构。
3)Micro structure设计方向上,无时延的候选操作设计,本申请实施例的应用实例为因果方式,可选的,还可以通过mask-based的方式实现无时延的候选操作设计,例如,上述卷积操作可以实现为基于Pixel卷积神经网络(Pixel CNN)的卷积操作。
步骤403,模型训练设备基于该网络搜索模型构建语音识别模型。
其中,该语音识别模型用于对输入的流式语音数据进行处理,获得该流式语音数据对应的语音识别文本。
其中,当本申请所示的方案中,通过对初始网络进行模型搜索的目的是构建准确性较高的声学模型时,模型训练设备可以基于该网络搜索模型构建声学模型;该声学模型用于对该流式语音数据进行处理,获得该流式语音数据的声学识别信息;然后基于该声学模型以及解码图,构建语音识别模型。
一个语音识别模型,通常包含声学模型和解码图,其中,声学模型用于从输入的语音数 据中识别出声学识别信息,例如音素、音节等等,而解码图则用于根据声学模型识别出的声学识别信息,得到对应的识别文本。
其中,解码图通常包括且不限于音素/音节词典以及语言模型,其中,音素/音节词典通常包含字或词到音素/音节序列的映射。例如,输入一串音节序列串,音节词典可以输出对应的字或者词;通常来说,音素/音节词典与文本的领域无关,在不同的识别任务中为通用部分;语言模型通常由n-gram(n元)语言模型转换而来,语言模型用来计算一个句子出现的概率,其利用训练数据和统计学方法训练而来。通常来说,不同领域的文本,例如新闻和口语对话的文本,常用词和词间搭配存在较大的差异,因此,当进行不同领域的语音识别时,可以通过改变语言模型来实现适配。
本申请实施例提出的latency-controlled NAS算法,搜索得到的神经网络结构时延只由固定数目的reduction cell确定,模型结构迁移到各种语音识别应用方向时,迁移后的模型延时不会随着模型结构中cell层数的变化为变化,尤其是面向大规模语音识别任务,迁移后的模型结构非常复杂(cell层数很多),传统的NAS算法很难对延时进行有效地控制。而新算法的设计,能够保证迁移后的模型结构延时固定,适应各种语音识别任务,包括大规模语音识别任务,本申请的应用实例,能够生成面向大规模语音识别任务的低延时流式识别模型网络结构。
步骤404,语音识别设备接收流式语音数据。
上述语音识别模型构建完成后,可以部署至语音识别设备,执行对流式语音进行识别的任务。在流式语音识别任务中,流式语音识别场景中的语音采集设备可以持续采集流式语音,并输入语音识别设备。
步骤405,语音识别设备通过语音识别模型对该流式语音数据进行处理,获得该流式语音数据对应的语音识别文本。
在一种可能的实现方式中,该语音识别模型中包含声学模型以及解码图,该声学模型是基于该网络搜索模型构建的;
语音识别设备可以通过该声学模型对该流式语音数据进行处理,获得该流式语音数据的声学识别信息;该声学识别信息包括音素、音节或者半音节;然后通过该解码图对该流式语音数据的声学识别信息进行处理,获得该语音识别文本。
在本申请实施例中,当上述语音识别模型中的声学模型是通过上述步骤中的神经网络结构搜索构建的模型时,在语音识别过程中,语音识别设备可以通过语音识别模型中的声学模型对流式语音数据进行处理,得到相应的音节或者音素等声学识别信息,然后将声学识别信息输入至由语音词典、语言模型等构成的解码图中进行解码,得到相应的语音识别文本。
步骤406,语音识别设备输出该语音识别文本。
在本申请实施例中,语音识别设备输出语音识别文本之后,该语音识别文本可以应用于后续的处理,例如,将语音识别文本或者其翻译文本作为字幕进行展示,或者,将语音识别文本的翻译文本转换为语音后进行播放等等。
综上所述,本申请实施例所示的方案,通过将初始网络中的第一类型操作元的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据的指定操作,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
以上述图4所示的方案应用于流式语音识别任务为例,请参考图10,其是根据一示例性实施例示出的一种模型构建及语音识别框架示意图。
在模型训练设备中,首先从操作空间存储器1011中读取预设的操作空间1012(指定操作被设计成不依赖未来数据),并在样本集存储器中读取预设的语音训练样本(包括语音样 本和对应的音节信息),根据该预设的语音训练样本与该预设的操作空间1012,对预设的初始网络1013(比如上述图5所示的网络)进行神经网络结构搜索,获得网络搜索模型1014。
然后,模型训练设备基于网络搜索模型1014构建声学模型1015,该声学模型1015的输入可以为语音数据以及语音数据的历史识别结果对应的音节,输出为预测的当前语音数据的音节。
模型训练设备基于上述声学模型1015,以及预先设置好的解码图1016,构建语音识别模型1017,并将语音识别模型1017部署至语音识别设备中。
在语音识别设备中,语音识别设备获取语音采集设备采集到的流式语音数据1018,并对流式语音数据1018进行切分后,将切分得到的各个语音帧输入到语音识别模型1017中,由语音识别模型1017进行识别得到语音识别文本1019,并输出该语音识别文本1019,以便对语音识别文本1019执行展示/翻译/自然语言处理等操作。
图11是根据一示例性实施例示出的一种语音识别装置的结构方框图。该语音识别装置可以实现由图2或图4所示实施例提供的方法中的全部或部分步骤,该语音识别装置包括:
语音数据接收模块1101,用于接收流式语音数据。
语音数据处理模块1102,用于通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本;所述语音识别模型是通过对初始网络进行神经网络结构搜索获得的;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,且所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据。
文本输出模块1103,用于输出所述语音识别文本。
在一种可能的实现方式中,所述初始网络包含n个单元网络,所述n个单元网络中包含至少一个第一单元网络,所述第一单元网络中包含输入节点、输出节点、以及由所述第一类型操作元相连的至少一个所述特征聚合节点。
在一种可能的实现方式中,所述n个单元网络之间通过以下连接方式中的至少一种相连:
双链接方式、单链接方式、以及密集链接方式。
在一种可能的实现方式中,所述n个单元网络中包含至少一个第二单元网络,所述第二单元网络中包含输入节点、输出节点、以及由第二类型操作元相连的至少一个所述特征聚合节点;所述第二类型操作元对应的第二操作空间中包含依赖未来数据的所述指定操作;所述第二操作空间中的一种或者多种操作的组合用于实现所述第二类型操作元。
在一种可能的实现方式中,至少一个所述第一单元网络之间共享拓扑结构和网络参数,且至少一个所述第二单元网络之间共享拓扑结构和网络参数。
在一种可能的实现方式中,被设计为不依赖未来数据的指定操作是基于因果的所述指定操作;
或者,
被设计为不依赖未来数据的指定操作是基于掩膜的所述指定操作。
在一种可能的实现方式中,所述特征聚合节点用于对输入数据执行求和操作、拼接操作以及乘积操作中的至少一种。
在一种可能的实现方式中,所述指定操作包括卷积操作、池化操作、基于长短期记忆人工神经网络LSTM的操作、以及基于门控循环单元GRU的操作中的至少一种。
在一种可能的实现方式中,所述语音识别模型中包含声学模型以及解码图,所述声学模型是基于网络搜索模型构建的,所述网络搜索模型是通过语音训练样本对所述初始网络进行神经网络结构搜索获得的;
所述语音数据处理模块1102,用于,
通过所述声学模型对所述流式语音数据进行处理,获得所述流式语音数据的声学识别信息;所述声学识别信息包括音素、音节或者半音节;
通过所述解码图对所述流式语音数据的声学识别信息进行处理,获得所述语音识别文本。
综上所述,本申请实施例所示的方案,通过将初始网络中第一类型操作元对应的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
图12是根据一示例性实施例示出的一种语音识别装置的结构方框图。该语音识别装置可以实现由图3或图4所示实施例提供的方法中的全部或部分步骤,该语音识别装置包括:
样本获取模块1201,用于获取语音训练样本,所述语音训练样本中包含语音样本,以及所述语音样本对应的语音识别标签;
网络搜索模块1202,用于基于所述语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
模型构建模块1203,用于基于所述网络搜索模型构建语音识别模型;所述语音识别模型用于对输入的流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本。
在一种可能的实现方式中,所述语音识别标签包括所述语音样本的声学识别信息;所述声学识别信息包括音素、音节或者半音节;
所述模型构建模块1203,用于,
基于所述网络搜索模型构建声学模型;所述声学模型用于对所述流式语音数据进行处理,获得所述流式语音数据的声学识别信息;
基于所述声学模型以及所述解码图,构建所述语音识别模型。
综上所述,本申请实施例所示的方案,通过将初始网络中第一类型操作元对应的操作空间中,需要依赖上下文信息的指定操作设置为不依赖未来数据,然后对该初始网络进行神经网络结构搜索,以构建语音识别模型。由于模型中引入了不依赖未来数据的指定操作,且通过神经网络结构搜索可以搜索出准确性较高的模型结构,因此,通过上述方案,能够在保证语音识别的准确性的情况下,降低在流式语音识别场景下的识别时延,提高流式语音识别的效果。
图13是根据一示例性实施例示出的一种计算机设备的结构示意图。该计算机设备可以实现为上述各个方法实施例中的模型训练设备和/或语音识别设备。所述计算机设备1300包括中央处理单元1301、包括随机存取存储器(Random Access Memory,RAM)1302和只读存储器(Read-Only Memory,ROM)1303的系统存储器1304,以及连接系统存储器1304和中央处理单元1301的系统总线1305。所述计算机设备1300还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统1306,和用于存储操作系统1313、应用程序1314和其他程序模块1315的大容量存储设备1307。
所述大容量存储设备1307通过连接到系统总线1305的大容量存储控制器(未示出)连接到中央处理单元1301。所述大容量存储设备1307及其相关联的计算机可读介质为计算机设备1300提供非易失性存储。也就是说,所述大容量存储设备1307可以包括诸如硬盘或者光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、 闪存或其他固态存储其技术,CD-ROM、或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1304和大容量存储设备1307可以统称为存储器。
计算机设备1300可以通过连接在所述系统总线1305上的网络接口单元1311连接到互联网或者其它网络设备。
所述存储器还包括至少一条计算机指令,所述至少一条计算机指令存储于存储器中,处理器通过加载并执行该至少一条计算机指令来实现图2、图3或图4所示的方法的全部或者部分步骤。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括计算机程序(指令)的存储器,上述程序(指令)可由计算机设备的处理器执行以完成本申请各个实施例所示的方法。例如,所述非临时性计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各个实施例所示的方法。

Claims (16)

  1. 一种语音识别方法,用于计算机设备,所述方法包括:
    接收流式语音数据;
    通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本;所述语音识别模型是通过对初始网络进行神经网络结构搜索获得的;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,且所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
    输出所述语音识别文本。
  2. 根据权利要求1所述的方法,其中,所述初始网络包含n个单元网络,所述n个单元网络中包含至少一个第一单元网络,所述第一单元网络中包含输入节点、输出节点、以及由所述第一类型操作元相连的至少一个所述特征聚合节点。
  3. 根据权利要求2所述的方法,其中,所述n个单元网络之间通过以下连接方式中的至少一种相连:
    双链接方式、单链接方式、以及密集链接方式。
  4. 根据权利要求2所述的方法,其中,所述n个单元网络中包含至少一个第二单元网络,所述第二单元网络中包含输入节点、输出节点、以及由第二类型操作元相连的至少一个所述特征聚合节点;所述第二类型操作元对应的第二操作空间中包含依赖未来数据的所述指定操作;所述第二操作空间中的一种或者多种操作的组合用于实现所述第二类型操作元。
  5. 根据权利要求4所述的方法,其中,
    至少一个所述第一单元网络之间共享拓扑结构,或者,至少一个所述第一单元网络之间共享拓扑结构和网络参数;
    至少一个所述第二单元网络之间共享拓扑结构,或者,至少一个所述第二单元网络之间共享拓扑结构和网络参数。
  6. 根据权利要求1所述的方法,其中,
    被设计为不依赖未来数据的指定操作是基于因果的所述指定操作;
    或者,
    被设计为不依赖未来数据的指定操作是基于掩膜的所述指定操作。
  7. 根据权利要求1所述的方法,其中,所述特征聚合节点用于对输入数据执行求和操作、拼接操作以及乘积操作中的至少一种。
  8. 根据权利要求1至7任一所述的方法,其中,所述指定操作包括卷积操作、池化操作、基于长短期记忆人工神经网络LSTM的操作、以及基于门控循环单元GRU的操作中的至少一种。
  9. 根据权利要求1至7任一所述的方法,其中,所述语音识别模型中包含声学模型以及解码图,所述声学模型是基于网络搜索模型构建的,所述网络搜索模型是通过语音训练样本对所述初始网络进行神经网络结构搜索获得的;
    所述通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的 语音识别文本,包括:
    通过所述声学模型对所述流式语音数据进行处理,获得所述流式语音数据的声学识别信息;所述声学识别信息包括音素、音节或者半音节;
    通过所述解码图对所述流式语音数据的声学识别信息进行处理,获得所述语音识别文本。
  10. 一种语音识别方法,用于计算机设备,所述方法包括:
    获取语音训练样本,所述语音训练样本中包含语音样本,以及所述语音样本对应的语音识别标签;
    基于所述语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
    基于所述网络搜索模型构建语音识别模型;所述语音识别模型用于对输入的流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本。
  11. 根据权利要求10所述的方法,其中,所述语音识别标签包括所述语音样本的声学识别信息;所述声学识别信息包括音素、音节或者半音节;
    所述基于所述网络搜索模型构建语音识别模型,包括:
    基于所述网络搜索模型构建声学模型;所述声学模型用于对所述流式语音数据进行处理,获得所述流式语音数据的声学识别信息;
    基于所述声学模型以及所述解码图,构建所述语音识别模型。
  12. 一种语音识别装置,所述装置包括:
    语音数据接收模块,用于接收流式语音数据;
    语音数据处理模块,用于通过语音识别模型对所述流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本;所述语音识别模型是通过对初始网络进行神经网络结构搜索获得的;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,且所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
    文本输出模块,用于输出所述语音识别文本。
  13. 一种语音识别装置,所述装置包括:
    样本获取模块,用于获取语音训练样本,所述语音训练样本中包含语音样本,以及所述语音样本对应的语音识别标签;
    网络搜索模块,用于基于所述语音训练样本,对初始网络进行神经网络结构搜索,获得网络搜索模型;所述初始网络中包含通过第一类型操作元相连的多个特征聚合节点,所述第一类型操作元对应的操作空间为第一操作空间,所述第一操作空间中依赖上下文信息的指定操作被设计为不依赖未来数据;
    模型构建模块,用于基于所述网络搜索模型构建语音识别模型;所述语音识别模型用于对输入的流式语音数据进行处理,获得所述流式语音数据对应的语音识别文本。
  14. 一种计算机设备,所述计算机设备包含处理器和存储器,所述存储器中存储有至少一条计算机指令,所述至少一条计算机指令由所述处理器加载并执行以实现如权利要求1至11任一所述的语音识别方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有至少一条计算机指令,所述至少一条计算机指令由处理器加载并执行以实现如权利要求1至11任一所述的语音识别方法。
  16. 一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备实现如权利要求1至11任一所述的语音识别方法。
PCT/CN2022/070388 2021-01-12 2022-01-05 语音识别方法、装置、计算机设备及存储介质 WO2022152029A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023524506A JP2023549048A (ja) 2021-01-12 2022-01-05 音声認識方法と装置並びにコンピュータデバイス及びコンピュータプログラム
US17/987,287 US20230075893A1 (en) 2021-01-12 2022-11-15 Speech recognition model structure including context-dependent operations independent of future data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110036471.8A CN113516972B (zh) 2021-01-12 2021-01-12 语音识别方法、装置、计算机设备及存储介质
CN202110036471.8 2021-01-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/987,287 Continuation US20230075893A1 (en) 2021-01-12 2022-11-15 Speech recognition model structure including context-dependent operations independent of future data

Publications (1)

Publication Number Publication Date
WO2022152029A1 true WO2022152029A1 (zh) 2022-07-21

Family

ID=78060908

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070388 WO2022152029A1 (zh) 2021-01-12 2022-01-05 语音识别方法、装置、计算机设备及存储介质

Country Status (4)

Country Link
US (1) US20230075893A1 (zh)
JP (1) JP2023549048A (zh)
CN (1) CN113516972B (zh)
WO (1) WO2022152029A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516972B (zh) * 2021-01-12 2024-02-13 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质
CN115937526B (zh) * 2023-03-10 2023-06-09 鲁东大学 基于搜索识别网络的双壳贝类性腺区域分割方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010204274A (ja) * 2009-03-02 2010-09-16 Toshiba Corp 音声認識装置、その方法及びそのプログラム
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统
CN109448707A (zh) * 2018-12-18 2019-03-08 北京嘉楠捷思信息技术有限公司 一种语音识别方法及装置、设备、介质
CN110930980A (zh) * 2019-12-12 2020-03-27 苏州思必驰信息科技有限公司 一种中英文混合语音的声学识别模型、方法及系统
CN112185352A (zh) * 2020-08-31 2021-01-05 华为技术有限公司 语音识别方法、装置及电子设备
CN113516972A (zh) * 2021-01-12 2021-10-19 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5914119B2 (ja) * 2012-04-04 2016-05-11 日本電信電話株式会社 音響モデル性能評価装置とその方法とプログラム
US20190043496A1 (en) * 2017-09-28 2019-02-07 Intel Corporation Distributed speech processing
CN110288084A (zh) * 2019-06-06 2019-09-27 北京小米智能科技有限公司 超网络训练方法和装置
CN110599999A (zh) * 2019-09-17 2019-12-20 寇晓宇 数据交互方法、装置和机器人
CN111582453B (zh) * 2020-05-09 2023-10-27 北京百度网讯科技有限公司 生成神经网络模型的方法和装置
CN111968635B (zh) * 2020-08-07 2024-03-05 北京小米松果电子有限公司 语音识别的方法、装置及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010204274A (ja) * 2009-03-02 2010-09-16 Toshiba Corp 音声認識装置、その方法及びそのプログラム
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN109036391A (zh) * 2018-06-26 2018-12-18 华为技术有限公司 语音识别方法、装置及系统
CN109448707A (zh) * 2018-12-18 2019-03-08 北京嘉楠捷思信息技术有限公司 一种语音识别方法及装置、设备、介质
CN110930980A (zh) * 2019-12-12 2020-03-27 苏州思必驰信息科技有限公司 一种中英文混合语音的声学识别模型、方法及系统
CN112185352A (zh) * 2020-08-31 2021-01-05 华为技术有限公司 语音识别方法、装置及电子设备
CN113516972A (zh) * 2021-01-12 2021-10-19 腾讯科技(深圳)有限公司 语音识别方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113516972B (zh) 2024-02-13
CN113516972A (zh) 2021-10-19
JP2023549048A (ja) 2023-11-22
US20230075893A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
US20220139393A1 (en) Driver interface with voice and gesture control
WO2022152029A1 (zh) 语音识别方法、装置、计算机设备及存储介质
US10956480B2 (en) System and method for generating dialogue graphs
CN112528637B (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
US11562735B1 (en) Multi-modal spoken language understanding systems
US12008038B2 (en) Summarization of video artificial intelligence method, system, and apparatus
CN116250038A (zh) 变换器换能器:一种统一流式和非流式语音识别的模型
US20230154172A1 (en) Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
KR20240053639A (ko) 제한된 스펙트럼 클러스터링을 사용한 화자-턴 기반 온라인 화자 구분
KR20240068704A (ko) 준지도 스피치 인식을 위한 대조 샴 네트워크
WO2023084348A1 (en) Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
KR20190074508A (ko) 챗봇을 위한 대화 모델의 데이터 크라우드소싱 방법
US20240046921A1 (en) Method, apparatus, electronic device, and medium for speech processing
EP4154190B1 (en) Artificial intelligence system for sequence-to-sequence processing with dual causal and non-causal restricted self-attention adapted for streaming applications
US20230290345A1 (en) Code-Mixed Speech Recognition Using Attention and Language-Specific Joint Analysis
US11984125B2 (en) Speech recognition using on-the-fly-constrained language model per utterance
CN112530416B (zh) 语音识别方法、装置、设备和计算机可读介质
KR20220133064A (ko) 대화 요약 모델 학습 장치 및 방법
US10559298B2 (en) Discussion model generation system and method
US20220310061A1 (en) Regularizing Word Segmentation
CN113516996B (zh) 语音分离方法、装置、计算机设备及存储介质
KR20240093516A (ko) 멀티모달 융합 기반 딥 뉴럴 네트워크를 사용하는 멀티미디어 비디오들에서의 감정 인식
KR20220102934A (ko) 자연어 이해를 위한 그래프 변환 시스템 및 방법
JP2024525255A (ja) ストリーミングアプリケーション向けに適合された、デュアル因果的および非因果的な制限された自己注意を用いたSequence-to-Sequence処理のための人工知能システム
WO2024124133A1 (en) Video-text modeling with zero-shot transfer from contrastive captioners

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22738910

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023524506

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.11.2023)