CN113838466B

CN113838466B - Speech recognition method, device, equipment and storage medium

Info

Publication number: CN113838466B
Application number: CN202110668257.4A
Authority: CN
Inventors: 苏丹; 贺利强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2024-02-06
Anticipated expiration: 2041-06-16
Also published as: CN113838466A

Abstract

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a voice recognition storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, and the first feature extraction unit comprises an attention network; adding at least one feature extraction network into a first feature extraction unit at least once, and connecting with an attention network to obtain an alternative voice recognition model; and selecting a second voice recognition model for voice recognition from the at least two voice recognition models according to the recognition performance of the at least two voice recognition models. The structure of the second voice recognition model obtained by the method can get rid of the limit of human experience, and the needed second voice recognition model can be obtained according to the recognition performance. And, the second speech recognition model is able to improve the performance of speech recognition using the attention mechanism.

Description

Speech recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice recognition.

Background

Speech recognition technology is a technology that converts speech into text through recognition and parsing processes. In the related speech recognition technology, speech recognition is generally performed based on a speech recognition model, which requires that the speech recognition model is constructed first.

In the process of constructing a speech recognition model, a technician usually determines the structure of the speech recognition model by hand, and then trains the corresponding speech recognition model according to the determined structure, however, the structure of the speech recognition model is limited by hand experience, which may result in poor recognition performance of the speech recognition model.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a device, equipment and a storage medium, which can improve the voice recognition performance of a voice recognition model. The technical scheme is as follows:

in one aspect, a method for speech recognition is provided, the method comprising:

acquiring a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection modes among the input network, the first feature extraction unit and the output network are determined, and the first feature extraction unit comprises an attention network;

Adding at least one feature extraction network into the first feature extraction unit at least twice, and connecting with the attention network to obtain an alternative voice recognition model;

in response to obtaining at least two alternative speech recognition models, selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to recognition performance of the at least two alternative speech recognition models.

In a possible implementation manner, the adding at least one feature extraction network to the first feature extraction unit at least twice, and connecting with the attention network, to obtain an alternative speech recognition model includes:

at least two times, selecting at least one feature extraction network from a first network set, adding the at least one feature extraction network into the first feature extraction unit, and connecting with the attention network to obtain an alternative voice recognition model;

wherein the first network set includes a plurality of alternative feature extraction networks.

In a possible implementation manner, the selecting at least one feature extraction network from the first network set includes:

selecting any number from the first number range;

The number of feature extraction networks is selected from the first set of networks.

In one possible implementation manner, the first network set includes a plurality of different second network sets, and the selecting the number of feature extraction networks from the first network set includes:

determining a plurality of second network sets corresponding to the number from the first network sets, wherein each second network set corresponding to the number comprises the number of feature extraction networks;

and selecting each feature extraction network in one second network set corresponding to the number.

In a possible implementation manner, the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit does not comprise the attention network, and the connection modes among the input network, the first feature extraction unit, the second feature extraction unit and the output network are determined;

the method further comprises the steps of: at least one feature extraction network is added to the second feature extraction unit at least once, resulting in an alternative speech recognition model.

In a possible implementation manner, before the selecting, according to the recognition performance of the at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models, the method further includes:

Acquiring a test set, wherein the test set comprises first sample voice and first sample text corresponding to the first sample voice;

and respectively identifying the first sample voice based on each alternative voice identification model, and determining the identification performance of each alternative voice identification model according to the identified text and the first sample text.

acquiring a first training set, wherein the first training set comprises a second sample voice and a second sample text corresponding to the second sample voice;

and respectively identifying the second sample voice based on each alternative voice identification model, and training each alternative voice identification model according to the error between the identified text and the second sample text.

In a possible implementation manner, after the selecting the second speech recognition model from the at least two candidate speech recognition models, the method further includes:

And performing voice recognition based on the second voice recognition model.

In a possible implementation manner, before the performing speech recognition based on the second speech recognition model, the method further includes:

acquiring a second training set, wherein the second training set comprises a third sample voice and a third sample text corresponding to the third sample voice;

and respectively identifying the third sample voice based on each alternative voice identification model, and training the second voice identification model according to the error between the identified text and the third sample text.

acquiring a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection mode among the networks is not determined, and the networks comprise an input network, an attention network and an output network;

at least one feature extraction unit is connected with a plurality of networks in the first voice recognition model at least once in at least two connection modes to obtain at least two alternative voice recognition models;

and selecting a second voice recognition model for voice recognition from the at least two voice recognition models according to the recognition performance of the at least two voice recognition models.

In one possible implementation manner, the connecting at least one feature extraction unit with the multiple networks in the first speech recognition model at least once according to at least two connection modes to obtain at least two alternative speech recognition models includes:

at least one feature extraction unit is selected from a plurality of feature extraction units at least once, and a plurality of networks in the first voice recognition model are connected with the selected at least one feature extraction unit according to at least two connection modes, so that at least one alternative voice recognition model is obtained.

In a possible implementation manner, the selecting at least one feature extraction unit from the plurality of feature extraction units includes:

selecting any number from the second number range;

the number of feature extraction units is selected from the plurality of feature extraction units.

In a possible implementation manner, the selecting the number of feature extraction units from the plurality of feature extraction units includes:

determining a plurality of unit sets corresponding to the number, wherein each unit set comprises the feature extraction units of the number;

each feature extraction unit in any unit set is selected.

In a possible implementation manner, the acquiring at least one feature extraction unit based on the plurality of feature extraction networks includes:

selecting a feature extraction network from a first network set, and determining the feature extraction network as the feature extraction unit; or,

selecting at least two feature extraction networks from the first network set, and connecting the at least two feature extraction networks to obtain the feature extraction unit;

In a possible implementation manner, the selecting at least two feature extraction networks from the first network set includes:

selecting any number from a first number range, wherein the number in the first number range is not less than 2;

In a possible implementation manner, the connecting the at least two feature extraction networks to obtain the feature extraction unit includes:

and connecting the at least two feature extraction networks in at least two connection modes to obtain at least two feature extraction units.

In one possible implementation, the connection between the at least two feature extraction networks includes a double chain bi-chain-style, a chain-style, or a dense-connected.

and performing voice recognition based on the second voice recognition model.

In a possible implementation manner, after the selecting, according to the recognition performance of the at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models, the method further includes:

creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model;

and adding the second feature extraction unit into the second voice recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain the updated second voice recognition model.

In a possible implementation manner, during the process of performing voice recognition based on the second voice recognition model, a shape of a voice feature input to the attention network is c×t×f, the number of channel dimensions included in the voice feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and all of C, T and F are positive integers;

The process of speech recognition based on the attention network comprises:

transforming the shape of the speech feature to T x Z such that the transformed speech feature no longer comprises a channel dimension and a frequency dimension, and the feature size in each time dimension is the Z, wherein Z is the product of the C and the F;

and determining attention weights corresponding to the voice features based on the transformed voice features, carrying out weighting processing on the transformed voice features based on the attention weights, recovering the shape of the weighted voice features to be C.T.F, and outputting the voice features with the shape recovered.

In another aspect, there is provided a speech recognition apparatus, the apparatus comprising:

the system comprises a model acquisition module, a model extraction module and a model extraction module, wherein the model acquisition module is used for acquiring a first voice recognition model, the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection modes among the input network, the first feature extraction unit and the output network are determined, and the first feature extraction unit comprises an attention network;

the network adding module is used for adding at least one feature extraction network into the first feature extraction unit at least twice and connecting with the attention network to obtain an alternative voice recognition model;

The model selection module is used for responding to obtaining at least two alternative voice recognition models, and selecting a second voice recognition model for carrying out voice recognition from the at least two alternative voice recognition models according to the recognition performance of the at least two alternative voice recognition models.

In one possible implementation manner, the model obtaining module is configured to connect the plurality of first feature extraction units according to a dual-chain bi-stable connection manner, a chain-stable connection manner, or a dense connected-connected connection manner, so as to obtain a unit chain; and respectively connecting the input network and the output network at two ends of the unit chain to obtain the first voice recognition model.

In a possible implementation manner, the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit in a different manner, and connect with the attention network to obtain different alternative speech recognition models.

In one possible implementation manner, the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit, and connect with the attention network according to a dual-chain bi-based connection manner, a chain-based connection manner, or a dense-connected connection manner, so as to obtain the alternative speech recognition model.

In a possible implementation manner, the first voice recognition model includes a plurality of first feature extraction units, and a connection manner between the plurality of first feature extraction units is determined; the connection mode of a plurality of networks in the first feature extraction unit is different from the connection mode of a plurality of first feature extraction units.

In one possible implementation manner, the first speech recognition model includes N-1 first feature extraction units and N unit groups, each unit group includes M second feature extraction units, where N is an integer greater than 1, M is a positive integer, the second feature extraction units do not include the attention network, and a connection manner of the network in the first speech recognition model is that: the two ends of the first voice recognition model are the input network and the output network, one unit packet is connected after the input network, one unit packet is connected before the output network, and one first feature extraction unit is connected between every two unit packets.

In one possible implementation, the apparatus further includes:

a model updating module for creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model; and adding the fourth feature extraction unit into the second voice recognition model, and connecting the fourth feature extraction unit with the third feature extraction unit to obtain the updated second voice recognition model.

the process of speech recognition based on the attention network comprises:

In one possible implementation manner, the network adding module includes:

the network selection sub-module is used for selecting at least one feature extraction network from the first network set at least twice;

A network adding sub-module, configured to add the at least one feature extraction network to the first feature extraction unit, and connect with the attention network to obtain an alternative speech recognition model;

In one possible implementation manner, the network selection submodule includes:

a number selecting unit for selecting any number from the first number range;

and the network selection unit is used for selecting the number of feature extraction networks from the first network set.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selection unit is configured to determine, from the first network sets, a plurality of second network sets corresponding to the number, where each second network set corresponding to the number includes the number of feature extraction networks; and selecting each feature extraction network in one second network set corresponding to the number.

In a possible implementation manner, the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit does not comprise the attention network, and the connection modes among the input network, the first feature extraction unit, the second feature extraction unit and the output network are determined; the network adding module is further configured to add at least one feature extraction network to the second feature extraction unit at least once to obtain an alternative speech recognition model.

In one possible implementation, the apparatus further includes:

the performance determining module is used for obtaining a test set, wherein the test set comprises first sample voice and first sample text corresponding to the first sample voice; and respectively identifying the first sample voice based on each alternative voice identification model, and determining the identification performance of each alternative voice identification model according to the identified text and the first sample text.

In one possible implementation, the apparatus further includes:

the first training module is used for acquiring a first training set, wherein the first training set comprises a second sample voice and a second sample text corresponding to the second sample voice; and respectively identifying the second sample voice based on each alternative voice identification model, and training each alternative voice identification model according to the error between the identified text and the second sample text.

In one possible implementation, the apparatus further includes:

and the voice recognition module is used for carrying out voice recognition based on the second voice recognition model.

In one possible implementation, the apparatus further includes:

The second training module is used for acquiring a second training set, and the second training set comprises a third sample voice and a third sample text corresponding to the third sample voice; and respectively identifying the third sample voice based on each alternative voice identification model, and training the second voice identification model according to the error between the identified text and the third sample text.

the system comprises a model acquisition module, a model generation module and a model generation module, wherein the model acquisition module is used for acquiring a first voice recognition model, the first voice recognition model comprises a plurality of networks, the connection mode among the networks is not determined, and the networks comprise an input network, an attention network and an output network;

the network connection module is used for connecting at least one feature extraction unit with a plurality of networks in the first voice recognition model at least once in at least two connection modes to obtain at least two alternative voice recognition models;

the model selection module is used for selecting a second voice recognition model for voice recognition from the at least two alternative voice recognition models according to the recognition performance of the at least two alternative voice recognition models.

In one possible implementation, the connection includes a double chain bi-chain-style, chain-style, or dense-connected.

In one possible implementation, the apparatus includes:

and the unit acquisition module is used for acquiring at least one feature extraction unit based on a plurality of feature extraction networks, and each acquired feature extraction unit comprises at least one feature extraction network.

In one possible implementation manner, the network connection module includes:

the unit selection sub-module is used for selecting at least one feature extraction unit from the plurality of feature extraction units at least once;

and the unit connection sub-module is used for connecting a plurality of networks in the first voice recognition model with at least one selected feature extraction unit according to at least two connection modes to obtain at least one alternative voice recognition model.

In one possible implementation, the unit selection sub-module includes:

a first number selecting unit for selecting any number from the second number range;

and a unit selecting unit configured to select the number of feature extraction units from the plurality of feature extraction units.

In a possible implementation manner, the unit selecting unit is configured to determine a plurality of unit sets corresponding to the number, where each unit set includes the number of feature extraction units; each feature extraction unit in any unit set is selected.

In one possible implementation manner, the unit acquisition module includes:

the first unit acquisition submodule is used for selecting a feature extraction network from the first network set and determining the feature extraction network as the feature extraction unit; or,

the second unit obtaining submodule is used for selecting at least two feature extraction networks from the first network set, and connecting the at least two feature extraction networks to obtain the feature extraction unit;

In one possible implementation manner, the second unit obtaining sub-module includes:

a second number selecting unit configured to select any number from a first number range, the number in the first number range being not less than 2;

In one possible implementation manner, the second unit obtains a sub-module, which is configured to connect the at least two feature extraction networks in at least two connection manners, so as to obtain at least two feature extraction units.

In one possible implementation, the apparatus further includes:

A model updating module for creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model; and adding the second feature extraction unit into the second voice recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain the updated second voice recognition model.

the process of speech recognition based on the attention network comprises:

In another aspect, an electronic device is provided that includes a processor and a memory having stored therein at least one computer program that is loaded and executed by the processor to implement the operations performed in the speech recognition method in any of the possible implementations described above.

In another aspect, a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement operations performed in a speech recognition method in any of the possible implementations described above is provided.

In yet another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising a computer program stored in a computer readable storage medium. A processor of an electronic device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the electronic device performs the operations performed in the speech recognition method in the above-described various alternative implementations.

The beneficial effects that technical scheme that this application embodiment provided include at least:

In the embodiment of the application, the structure of the voice recognition model is not designed by a user completely, but a plurality of alternative voice recognition models are automatically created by adding a feature extraction network into the first voice recognition model, and then a second voice recognition model which is required is selected from the alternative voice recognition models according to the recognition performance, so that the structure of the obtained second voice recognition model can be free from the limitation of human experience. And the second voice recognition model comprises an attention network, so that the second voice recognition model can utilize an attention mechanism to improve the voice recognition performance of the voice recognition model when performing voice recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by embodiments of the present application;

FIG. 2 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a first speech recognition model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative speech recognition model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an attention network according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 8 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 9 is a block diagram of a speech recognition device provided in an embodiment of the present application;

FIG. 10 is a block diagram of a speech recognition device provided in an embodiment of the present application;

FIG. 11 is a block diagram of a speech recognition device provided in an embodiment of the present application;

FIG. 12 is a block diagram of a speech recognition device provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth," and the like as used herein may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first training sample may be referred to as a training sample, and similarly, a second training sample may be referred to as a first training sample, without departing from the scope of the present application.

The terms "at least one," "a plurality," "each," "any one," as used herein, include one, two or more, a plurality includes two or more, and each refers to each of a corresponding plurality, any one referring to any one of the plurality. For example, the plurality of feature extraction networks includes 3 feature extraction networks, and each of the 3 feature extraction networks refers to each of the 3 feature extraction networks, and any of the 3 feature extraction networks may be the first, the second, or the third.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

According to the scheme provided by the embodiment of the application, the voice recognition model can be obtained according to the artificial intelligence voice technology, natural language processing, machine learning and other technologies, and voice recognition is performed through the voice recognition model.

FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected by a wireless or wired network. Optionally, the terminal 101 is a smart phone, tablet, notebook, desktop, smart speaker, smart watch, vehicle-mounted terminal, video camera, smart hardware/home, medical device, or other terminal. Optionally, the server 102 is a stand-alone physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

Alternatively, the terminal 101 has installed thereon a target application served by the server 102, through which the terminal 101 can realize functions such as data transmission, message interaction, and the like. Alternatively, the target application is a target application in the operating system of the terminal 101 or a target application provided for a third party. The target application has a voice recognition function, and of course, the target application can have other functions, which are not limited in this embodiment of the present application. Alternatively, the target application is a short video application, a music application, a game application, a shopping application, a chat application, or other application, to which the present disclosure is not limited.

In this embodiment, the terminal 101 or the server 102 is configured to obtain a first speech recognition model, adjust a structure of the first speech recognition model to obtain a second speech recognition model, and perform speech recognition based on the second speech recognition model. Alternatively, the server 102 is configured to adjust the first speech recognition model to obtain a second speech recognition model, send the second speech recognition model to the terminal 101, and then the terminal 101 receives the second speech recognition model and performs speech recognition based on the second speech recognition model.

The speech recognition method in the present application can be applied to various scenes of speech recognition. For example, after the server acquires the second voice recognition model through the voice recognition method provided by the application, a calling interface of the second voice recognition model is provided for the terminal, and after the terminal receives the voice input by the user, the terminal calls the second voice recognition model to recognize the voice input by the user based on the calling interface of the second voice recognition model, and the corresponding text is output. Or after the server acquires the second voice recognition model, the terminal acquires the second voice recognition model from the server, stores the second voice recognition model, and subsequently, after receiving the voice input by the user, invokes the stored second voice recognition model to recognize the voice and output the corresponding text.

The voice recognition method provided by the embodiment of the application can also be applied to intelligent question-answering scenes. For example, after the terminal obtains the second voice recognition model through the method provided by the application, voice recognition is performed on the input voice to obtain a corresponding text, then a reply text corresponding to the text is obtained, the reply text is output, or the reply text is converted into voice, and the converted voice is output. For example, after the user inputs the voice "how today is weather", the terminal recognizes the voice and obtains the corresponding text "how today is weather", and searches the reply text corresponding to the text, for example, if the searched reply text is "sunny", the terminal outputs the text, or the terminal converts the text into the voice "sunny", and outputs the voice.

In fact, the voice recognition method provided by the application can also be applied to other voice recognition scenes, and the embodiment of the application is not limited to this.

Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 2, this embodiment includes:

201. the electronic device acquires a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection modes among the input network, the first feature extraction unit and the output network are determined, and the first feature extraction unit comprises an attention network.

The function of the first speech recognition model is to perform speech recognition, i.e. to convert the speech input into the first speech recognition model into corresponding text. The first speech recognition model includes an input network, a first feature extraction unit, and an output network. The input network is used for extracting characteristics of input voice and outputting voice characteristics. Alternatively, the voice feature includes MFCC (Mel-Frequency Ceptral Coefficients, mel-frequency cepstral coefficient), fbank (Filter bank based feature) or the like, and the voice feature is in the form of a spectrogram. The first feature extraction unit is used for further feature extraction of the input voice features and outputting the voice features. The output network is used for converting the input voice characteristics into corresponding texts and outputting the texts. The first feature extraction unit comprises an attention network, wherein the attention network is used for further feature extraction of the input voice features, and the accuracy of the extracted voice features is ensured through an attention mechanism when the feature extraction is carried out. Optionally, the first feature extraction unit further comprises other networks, such as a convolutional network, a pooled network, etc., in addition to the attention network.

Optionally, the number of the first feature extraction units is a plurality. Optionally, a connection manner among the input network, the first feature extraction unit and the output network is an arbitrary connection manner, which is not limited in the embodiment of the present application.

Note that, the speech Feature (Feature Mapping) in the embodiment of the present application may also be referred to as Feature Mapping between hidden layers of a neural network.

202. The electronic device adds at least one feature extraction network to the first feature extraction unit at least once and connects with the attention network to obtain an alternative speech recognition model.

The feature extraction network is used for further extracting features of the input voice features and outputting the voice features. The feature extraction network includes a convolutional network, a pooled network, and the like. Also, the feature extraction network includes feature extraction networks of various structures for the same feature extraction network. For example, for convolutional networks, convolutional networks include the convolutional network of 1*1, the convolutional network of 3*3, and the like.

Optionally, the connection manner of the at least one feature extraction network and the attention network is any connection manner, which is not limited in the embodiment of the present application.

In the case where at least one feature extraction network is added to the first feature extraction unit a plurality of times, the number of the at least one feature extraction networks added each time by the electronic device is the same or different. For example, 1 feature extraction network is added for the first time, 1 feature extraction network is also added for the second time, or 2 feature extraction networks are added for the second time. In the case where the number of feature extraction networks added plural times is the same, the feature extraction networks added each time are the same or different. For example, 1 convolutional network 1*1 is added for the first time, one 3*3 convolutional network is added for the second time, or 1 convolutional network 1*1 is also added for the second time. In the case where the feature extraction networks added multiple times are the same, the manner in which the feature extraction networks are added to the first feature extraction unit is different, for example, one 1*1 convolutional network is added a first time and one 1*1 convolutional network is added a second time, but the manner in which the convolutional network is added to the first feature extraction unit twice is different. For example, the convolution network is added to the first feature extraction unit for the first time as an upper layer network of the attention network in the first feature extraction unit, and the convolution network is added to the first feature extraction unit for the second time as a lower layer network of the attention network in the first feature extraction unit.

203. The electronic device, in response to obtaining the at least two alternative speech recognition models, selects a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to recognition performance of the at least two alternative speech recognition models.

The recognition performance represents the speech recognition effect of the speech recognition model, and the better the recognition performance is, the better the speech recognition effect is. Alternatively, the recognition performance of the speech recognition model is represented by the recognition accuracy. Of course, the identification performance can also be represented by other parameters, such as identification efficiency, which is not limited by the embodiments of the present application.

Optionally, the second speech recognition model is the candidate speech recognition model with the highest recognition accuracy among the at least two candidate speech recognition models. Or the second voice recognition model is an alternative voice recognition model with the simplest structure, wherein the recognition accuracy reaches an accuracy threshold. Or, the second speech recognition model is any alternative speech recognition model with the recognition accuracy reaching the accuracy threshold. Alternatively, the second speech recognition model is the most efficient candidate speech recognition model. Alternatively, the second speech recognition model is any alternative speech recognition model for which the recognition efficiency reaches an efficiency threshold. The above-mentioned second speech recognition model selected from the alternative speech recognition models is merely an exemplary illustration, and the embodiments of the present application are not limited thereto.

It should be noted that, steps 202-203 are actually a process of searching the structure of the second speech recognition model, that is, searching to obtain a plurality of candidate speech recognition models based on the first speech recognition model, where the structures of the plurality of candidate speech recognition models are different, and then the second speech recognition model can be selected from the candidate speech recognition models.

After the second speech recognition model is selected, speech recognition can be performed based on the second speech recognition model, for example, the target speech is input into the second speech recognition model, and the second speech recognition model recognizes the target speech and outputs the text corresponding to the target speech.

It should be noted that, the speech recognition model in the present application is any neural network, for example, CNN (Convolutional Neural Networks, convolutional neural network), which is not limited in this embodiment of the present application.

Fig. 3 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 3, this embodiment includes:

301. the electronic device acquires a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection modes among the input network, the first feature extraction unit and the output network are determined, and the first feature extraction unit comprises an attention network.

In one possible implementation, the electronic device obtains a first speech recognition model, including: the electronic equipment connects a plurality of first feature extraction units according to a bi-chain-style connection mode, a chain-style connection mode or a dense-connected connection mode to obtain a unit chain; and connecting an input network and an output network at two ends of the unit chain respectively to obtain a first voice recognition model. Of course, the plurality of first feature extraction units can also be connected in other connection manners, which is not limited in the embodiment of the present application.

In the embodiment of the application, the plurality of feature extraction units are connected according to the determined connection mode, so that in the process of searching the structure based on the first voice recognition model to obtain the alternative voice recognition model, the connection mode among the plurality of feature extraction units does not need to participate in searching, and the efficiency of searching the model structure can be improved.

In one possible implementation, the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit does not comprise an attention network, and the connection between the input network, the first feature extraction unit, the second feature extraction unit and the output network is determined.

Optionally, the connection modes among the input network, the first feature extraction unit, the second feature extraction unit and the output network are arbitrary connection modes. Optionally, the number of second feature extraction units in the first speech recognition model is any number.

Optionally, in case the first speech recognition model comprises the second feature extraction unit, the electronic device obtains the first speech recognition model comprising: the electronic equipment connects the plurality of first feature extraction units and the plurality of second feature extraction units according to a target connection mode to obtain a unit chain; and connecting an input network and an output network at two ends of the unit chain respectively to obtain a first voice recognition model. Optionally, the target connection manner includes bi-chain-driven, chain-driven or dense-connected, which is not limited in the embodiment of the present application.

In one possible implementation, the first speech recognition model includes N-1 first feature extraction units and N groupings of units, each including M second feature extraction units, the second feature extraction units not including an attention network. The network connection mode in the first voice recognition model is as follows: the two ends of the first voice recognition model are an input network and an output network, wherein a unit packet is connected after the input network, a unit packet is connected before the output network, and a first feature extraction unit is connected between every two unit packets. Wherein N is an integer greater than 1, and M is a positive integer. For example, N is 3 and m is 5, which is not limited in the examples herein. Taking N as 3 as an example, the connection sequence of the network in the first speech recognition model is as follows: an input network, a unit group, a first feature extraction unit, a unit group, an output network.

It should be noted that, through experiments, the first speech recognition model has better speech recognition performance when being connected according to the above connection mode than other connection modes.

Fig. 4 is a schematic structural diagram of a first speech recognition model. Referring to fig. 4, the number of first feature extraction units in the first speech recognition model is 2, the number of unit packets is 3, i.e., the number of second feature extraction units is 3*M, and the structure of the first speech recognition model is: the input network is followed by M second feature extraction networks, which are followed by a first feature extraction network, which is followed by M second feature extraction networks, which are followed by another first feature extraction network, which is followed by M second feature extraction networks, which are followed by a first feature extraction network, which is followed by an output network, M being any positive integer. Optionally, the input network comprises two convolutional layers. Optionally, the output network includes a full connection layer and a normalization layer, although the input network and the output network can also include other layers, which are not limited in this embodiment of the present application.

302. The electronic equipment selects at least one feature extraction network from the first network set at least once, adds the at least one feature extraction network into the first feature extraction unit, and is connected with the attention network to obtain an alternative voice recognition model.

Wherein the first network set includes a plurality of alternative feature extraction networks. Alternatively, there are a variety of types of alternative feature extraction networks, such as convolutional networks, pooled networks. Alternatively, for the same type of feature extraction network, the type of feature extraction network has a plurality of structures. For example, for convolutional networks, convolutional networks including 1*1, 3*3, and the like.

In one possible implementation, the electronic device selects at least one feature extraction network from a first set of networks, including: the electronic equipment selects any number from the first number range; the number of feature extraction networks is selected from the first set of networks. Alternatively, the first number range is any number range, for example, the first number range is 1-10, which is not limited in the embodiments of the present application.

The number of feature extraction networks selected by the electronic device determines the number of layers of networks in the first feature extraction unit. For example, in the first feature extraction unit, there is a layer of attention network, the electronic device selects 3 feature extraction networks, and adds them to the first feature extraction unit, and the number of network layers in the first feature extraction unit is 4.

FIG. 5 is a schematic diagram of an alternative speech recognition model. Referring to fig. 5, a plurality of feature extraction units in an alternative speech recognition model are connected in a bi-chain-styled manner, and input features of each of the other feature extraction units except the first feature extraction unit are output features of the first two feature extraction units. The feature extraction unit internally comprises 4 feature extraction networks, the 4 feature extraction networks are connected in a dense-connected mode, namely every two feature extraction networks are connected with each other, and input features of each feature extraction network in the feature extraction unit are as follows: the feature extraction unit extracts output features of all feature extraction networks in front of each feature extraction network. Optionally, the 4 feature extraction networks within the feature extraction unit are arbitrary networks.

In one possible implementation, the first set of networks includes a plurality of different second sets of networks. Correspondingly, the electronic device selects the number of feature extraction networks from the first network set, including: the electronic equipment determines a plurality of second network sets corresponding to the number from the first network sets, and the electronic equipment selects each feature extraction network in one second network set corresponding to the number. Wherein each second set of networks corresponding to the number includes the number of feature extraction networks.

Since there are a plurality of different feature extraction networks in the alternative network set, for selecting a certain number of feature extraction networks, there may be a plurality of combinations of feature extraction networks, for example, 2 feature extraction networks are selected, and the combination of the two feature extraction networks is: one convolutional network and one pooled network, or two convolutional networks of different structures, or two convolutional networks of the same structure, etc. Thus, the first network set includes a plurality of second network sets, each of which corresponds to a combination of feature extraction networks.

In the embodiment of the application, since the first network set includes a plurality of second network sets, and each second network set corresponds to a combination form of feature extraction networks, the feature extraction networks are selected by using the second network sets, so that the efficiency of selecting any number of feature extraction networks from the first network set can be improved, and different combination forms of feature extraction networks selected each time are ensured, so that different structures of alternative voice recognition models constructed based on the selected feature extraction networks are ensured.

In one possible implementation, the first speech recognition model includes a plurality of first feature extraction units, and a connection manner between the plurality of first feature extraction units is determined; the connection manner of the plurality of networks in the first feature extraction unit is different from the connection manner between the plurality of first feature extraction units. For example, the connection mode between the plurality of first feature extraction units is bi-chain-style, and the connection mode of the plurality of networks in the first feature extraction units is dense-connected. In this way, the connection modes between the structural modules in the alternative voice recognition model can be enriched, so that the structural types of the alternative voice recognition model are enriched.

In one possible implementation, the electronic device adds at least one feature extraction network to the first feature extraction unit and connects with an attention network to obtain an alternative speech recognition model, including: the electronic equipment adds at least one feature extraction network into the first feature extraction unit, and connects with the attention network according to a double-chain bi-chain-style connection mode, a chain-style connection mode or a dense-density-connected connection mode to obtain an alternative voice recognition model. Of course, the at least one feature extraction network and the attention network can also be connected in other ways, which embodiments of the present application do not limit.

In the embodiment of the application, after the at least one feature extraction network is obtained, the at least one feature extraction network and the attention network in the first feature extraction unit can be connected in a plurality of modes, so that a plurality of alternative voice recognition models with different structures are obtained, the number of the alternative voice recognition models is increased, and the second voice recognition model with higher recognition performance can be conveniently selected from the number of the alternative voice recognition models. And the attention network is arranged in the plurality of alternative voice recognition models, so that the recognition performance of the voice recognition models can be improved by using an attention mechanism when the second voice recognition model performs voice recognition.

In a possible implementation manner, in case the first speech recognition model further comprises a second feature extraction unit, the method further comprises: the electronic device adds at least one feature extraction network to the second feature extraction unit at least once, resulting in an alternative speech recognition model.

Optionally, in the case where the feature extraction network is not included in the second feature extraction unit, the implementation manner in which the electronic device adds at least one feature extraction network to the second feature extraction unit is: the electronic equipment connects the plurality of feature extraction networks to obtain a network chain, the input end of the network chain is determined to be the input end of the second feature extraction unit, and the output end of the network chain is determined to be the output end of the second feature extraction unit. In the case that the second feature extraction unit includes a feature extraction network, the implementation manner in which the electronic device adds at least one feature extraction network to the second feature extraction unit is: the electronic equipment connects at least one feature extraction network with the original feature extraction network in the second feature extraction unit to obtain a network chain, the input end of the network chain is determined to be the input end of the second feature extraction unit, and the output end of the network chain is determined to be the output end of the second feature extraction unit.

Optionally, the electronic device adds at least one feature extraction network to the second feature extraction unit to obtain an alternative speech recognition model, including: and the electronic equipment adds at least one feature extraction network into the second feature extraction unit in different adding modes to obtain different alternative voice recognition models. Optionally, in the case that the second feature extraction unit does not include a feature extraction network, the electronic device connects the plurality of feature extraction networks in different connection manners to obtain a network chain, determines an input end of the network chain as an input end of the second feature extraction unit, and determines an output end of the network chain as an output end of the second feature extraction unit. Optionally, in the case that the second feature extraction unit includes a feature extraction network, the electronic device connects at least one feature extraction network with an original feature extraction network in the second feature extraction unit in different manners to obtain a network chain, determines an input end of the network chain as an input end of the second feature extraction unit, and determines an output end of the network chain as an output end of the second feature extraction unit.

The method of obtaining the at least one feature extraction network added to the second feature extraction unit is the same as the method of obtaining the at least one feature extraction network added to the first feature extraction unit, and will not be described herein.

It should be noted that step 302 is actually a process of searching the structure of the first feature extraction unit, and since the first speech recognition model includes the first feature extraction unit, a process of searching the structure of the first feature extraction unit, that is, a process of searching the structure of the alternative speech recognition model. Similarly, in the case that the first speech recognition model includes the second feature extraction unit, the electronic device adds at least one feature extraction network to the second feature extraction unit at least once, and the process of obtaining the candidate speech recognition model is a process of searching the structure of the second feature extraction unit. Alternatively, the two processes are performed separately, for example, after the electronic device searches for the structure of the first feature extraction unit, and then searches for the structure of the second feature extraction unit, for example, the electronic device adds at least one feature extraction unit to the first feature extraction unit multiple times to obtain multiple candidate speech recognition models, and then adds at least one feature extraction unit to the second feature extraction unit in the candidate speech recognition model at least once for each candidate speech recognition model to obtain a new candidate speech recognition model. Or after the electronic device searches the structure of the second feature extraction unit, searching the structure of the first feature extraction unit, for example, after the electronic device adds at least one feature extraction unit to the second feature extraction unit for multiple times to obtain multiple candidate voice recognition models, at least one feature extraction unit is added to the first feature extraction unit in the candidate voice recognition model at least once for each candidate voice recognition model to obtain a new candidate voice recognition model. Or, the structures of the first feature extraction unit and the second feature extraction unit are searched at the same time, for example, after at least one feature extraction network is added to the first feature extraction unit each time to obtain an alternative speech recognition model, at least one feature extraction network is added to the second feature extraction unit in the alternative speech recognition model to obtain a new alternative speech recognition model.

Optionally, the first feature extraction unit and the second feature extraction unit differ in network structure except for the attention network. The length and width of the feature size of the voice feature of the output first feature extraction unit are reduced by half respectively relative to the voice feature of the input first feature extraction unit, and the instant frequency domain resolution is halved. The feature size of the speech feature of the output second feature extraction unit remains unchanged relative to the speech feature of the input second feature extraction unit, and the instantaneous frequency domain resolution remains unchanged.

It should be noted that, in step 302, at least one feature extraction network is added to the first feature extraction unit at least once and connected to the attention network, so as to obtain one implementation method of the alternative speech recognition model, although the alternative speech recognition model can also be obtained by other methods, for example, at least one feature extraction network is added to the first feature extraction unit in a different manner and connected to the attention network, so as to obtain a different alternative speech recognition model.

In the embodiment of the application, after at least one feature extraction network is obtained, the at least one feature extraction network is added to the first feature extraction unit in different adding modes, so that a plurality of alternative voice recognition models with different structures can be obtained, the number of the alternative voice recognition models is expanded, and the second voice recognition model with higher recognition performance can be conveniently selected from the number of the alternative voice recognition models.

Optionally, the implementation manner of adding the at least one feature extraction network to the first feature extraction unit by the electronic device is as follows: the electronic equipment connects at least one feature extraction network and the attention network in the first feature extraction unit to obtain a network chain, determines the input end of the network chain as the input end of the first feature extraction unit, and determines the output end of the network chain as the output end of the first feature extraction unit. Correspondingly, the electronic device adds at least one feature extraction network to the first feature extraction unit in different ways and is connected with the attention network to obtain different alternative voice recognition models, and the method comprises the following steps: the electronic device connects at least one feature extraction network and the attention network in the first feature extraction unit in different connection modes to obtain a network chain, determines the input end of the network chain as the input end of the first feature extraction unit, and determines the output end of the network chain as the output end of the first feature extraction unit. Optionally, the electronic device connects at least one feature extraction network and the attention network in the first feature extraction unit in different connection manners to obtain a network chain, including: and the electronic equipment connects the plurality of feature extraction networks in different connection modes to obtain a network chain, and then connects the attention network to the output end of the network chain to obtain a final network chain.

303. The electronic device determines recognition performance of each of the candidate speech recognition models in response to deriving at least two candidate speech recognition models.

In one possible implementation, the electronic device determines recognition performance for each of the candidate speech recognition models, including: the electronic equipment acquires a test set, wherein the test set comprises first sample voice and first sample text corresponding to the first sample voice; the electronic equipment respectively identifies the first sample voice based on each alternative voice identification model, and determines the identification performance of each alternative voice identification model according to the identified text and the first sample text.

In the embodiment of the application, the recognition performance of each alternative voice recognition model is determined by using the test set, so that the second voice recognition model with good voice recognition performance can be conveniently selected from the alternative voice recognition models, and the recognition performance of voice recognition based on the second voice recognition model is ensured.

Optionally, the implementation manner of determining the recognition performance of each candidate speech recognition model by the electronic device according to the text obtained by recognition and the first text sample is as follows: and the electronic equipment determines a loss value of each alternative voice recognition model according to the text obtained by recognition and the first text, wherein the loss value is used for representing the recognition performance of the alternative voice recognition model, and the loss value and the recognition performance of the alternative semantic recognition model are in a negative correlation relation, namely, the smaller the loss value is, the better the recognition performance is. Optionally, the electronic device determines the loss value of each candidate speech recognition model through any loss function, which is not limited by the embodiments of the present application.

In one possible implementation, before the electronic device determines the recognition performance of each candidate speech recognition model, training each candidate speech recognition model is performed by: the electronic equipment acquires a first training set, wherein the first training set comprises a second sample voice and a second sample text corresponding to the second sample voice; the electronic equipment respectively identifies the second sample voice based on each alternative voice identification model, and trains each alternative voice identification model according to errors between the identified text and the second sample text.

Optionally, the implementation manner of training each candidate speech recognition model by the electronic device according to the error between the text obtained by recognition and the second sample text is as follows: the electronic device adjusts model parameters of each of the candidate speech recognition models so that errors between the text recognized based on each of the adjusted candidate speech recognition models and the second sample text become smaller. Optionally, the number of second sample voices in the first training set for training the candidate voice recognition model is any number, which is not limited in the embodiment of the present application.

In the embodiment of the application, before the recognition performance of each candidate speech recognition model is determined, the first training set is utilized to train each candidate speech recognition model, and when the second speech recognition model is selected from the candidate speech recognition models based on the recognition performance, the second speech recognition model with strong learning ability and generalization ability can be selected.

304. The electronic device selects a second voice recognition model for semantic recognition from the at least two alternative voice recognition models according to the recognition performance of the at least two alternative voice recognition models.

Optionally, the implementation manner of selecting the second speech recognition model from the at least two candidate speech recognition models by the electronic device according to the recognition performance of the at least two candidate speech recognition models is as follows: and the electronic equipment selects a second voice recognition model with the best recognition performance from the at least two candidate voice recognition models according to the recognition performance of the at least two candidate voice recognition models. For example, the electronic device selects a second speech recognition model with highest recognition accuracy from the at least two candidate speech recognition models according to the recognition accuracy of the at least two candidate speech recognition models. Therefore, the highest accuracy of the subsequent voice recognition based on the second voice recognition model can be ensured.

Optionally, the implementation manner of selecting the second voice recognition model from the at least two candidate voice recognition models by the electronic device according to the recognition accuracy of the at least two candidate voice recognition models is as follows: and the electronic equipment selects a second voice recognition model with the simplest model structure from a plurality of candidate voice recognition models with recognition accuracy greater than an accuracy threshold according to the recognition accuracy of the at least two candidate voice recognition models. Therefore, the accuracy rate of the second voice recognition model for voice recognition is guaranteed, and meanwhile the efficiency of the second voice recognition model for voice recognition can be improved. Of course, the second speech recognition model may also be selected by other manners, for example, any one of a plurality of candidate speech recognition models with recognition accuracy greater than the accuracy threshold is selected as the second speech recognition model, which is not limited in the embodiment of the present application.

In one possible implementation manner, after the electronic device selects the second speech recognition model from the at least two candidate speech recognition models according to the recognition performance of the at least two candidate speech recognition models in response to obtaining the at least two candidate speech recognition models, the method further includes: the electronic device responding to the selection operation of the third feature extraction unit in the second voice recognition model, and creating a fourth feature extraction unit which is the same as the third feature extraction unit; and the electronic equipment adds the fourth feature extraction unit into the second voice recognition model and is connected with the third feature extraction unit to obtain an updated second voice recognition model. Optionally, the third feature extraction unit is any feature extraction unit in the second speech recognition model.

Optionally, the electronic device adds the fourth feature extraction unit to the second speech recognition model and connects with the third feature extraction unit, so as to obtain an updated second speech recognition model in the following implementation manner: the electronic device inserts the fourth feature extraction unit between the third feature extraction unit and the other feature extraction units, and the connection mode of the inserted fourth feature extraction unit and the feature extraction units of the upper layer and the lower layer is the same as the connection mode of the third feature extraction unit and the feature extraction units of the upper layer and the lower layer before insertion. Of course, the electronic device can also add the fourth feature extraction unit to the second speech recognition model in other ways, which the embodiments of the present application do not limit. Optionally, the number of the fourth feature extraction units is any number, and the implementation manner of adding each fourth feature extraction unit in the second speech recognition model is the same.

In the embodiment of the application, the depth of the second voice recognition model can be increased by adding the same feature extraction unit as the existing feature extraction unit into the obtained second voice recognition model, so that the recognition performance of the second voice recognition model is further improved.

305. The electronic device performs speech recognition based on the second speech recognition model.

In one possible implementation manner, in the process of performing voice recognition by the electronic device based on the second voice recognition model, the shape of the voice feature input to the attention network is c×t×f, the number of channel dimensions included in the voice feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and both C, T and F are positive integers. The process of voice recognition by the electronic device based on the attention network comprises the following steps: the electronic device transforms the shape of the speech feature into T x Z such that the transformed speech feature no longer comprises a channel dimension and a frequency dimension, and the feature size in each time dimension is Z, where Z is the product of C and F; the electronic device determines attention weights corresponding to the voice features based on the transformed voice features, performs weighting processing on the transformed voice features based on the attention weights, restores the shape of the weighted voice features to be C x T x F, and outputs the voice features after shape restoration.

In the embodiment of the application, the shape of the voice feature is firstly transformed based on the attention network, so that the transformed voice feature does not contain channel dimension and frequency dimension, when the attention weight is generated based on the transformed voice feature, the attention weight is not limited to the voice feature in the channel, but can be generated by combining the inter-channel correlation of the voice feature, so that the generated attention weight is more accurate, the accuracy of the voice feature output by the attention network is improved, and the voice recognition performance is further improved.

Alternatively, the Attention network is any type of Attention network, such as a Self-Attention (Self-Attention) network, a Multi-Head Attention (Multi-Head Self-Attention) network, or the like, to which the embodiments of the present application are not limited.

Fig. 6 is a schematic structural diagram of an attention network, referring to fig. 6, a part in a dashed box represents the attention network, it can be seen from the figure that an upper network and a lower network of the attention network are respectively convolution networks, a shape of a voice feature input to the attention network is c×t×f, the voice feature is subjected to shape transformation to obtain a voice feature with a shape of t×z, wherein Z is a product of C and F, then the voice feature with a shape of t×z is subjected to feature mapping through three full connection layers respectively to obtain Q (queues) corresponding to an attention mechanism, K (keys) and V (values), wherein Q, K and V represent matrices formed by input voice features with different time dimensions, the shape of K is t×z, the shape of K is z×t, then Q and K are multiplied to obtain a product t×t, t×t represents an output voice feature corresponding to an input voice feature with different time dimensions when determining an input voice feature with respect to each time dimension, and then the input voice feature with a normalized voice feature is input voice feature with a normalized by a shape of F, wherein the shape of the voice feature is normalized by the input voice feature with respect to a value of F, and the input voice feature with respect to a dimension is obtained by the input voice feature with a normalized by the input voice feature with respect to a dimension of different time dimension.

It should be noted that, in the embodiment of the present application, the attention mechanism is fully utilized to perform better modeling on the correlation of the long time sequence, so that the neural network model of the attention mechanism is utilized to search out the model structure with better performance.

In one possible implementation, before the electronic device performs voice recognition based on the second voice recognition model, the method further includes: the electronic equipment acquires a second training set, wherein the second training set comprises a third sample voice and a third sample text corresponding to the third sample voice; the electronic equipment respectively identifies the third sample voice based on each alternative voice identification model, and trains the second voice identification model according to the error between the text obtained by identification and the third sample text. Wherein the second training set is different from the first training set. Optionally, the number of third sample voices in the second training set for the second voice recognition model is any number, which is not limited in the embodiment of the present application. The process of the electronic device training the second speech recognition model through the second training set is the same as the process of training the alternative speech recognition model through the first training set, and will not be described in detail here.

In the embodiment of the application, after the second voice recognition model is obtained by searching, the second voice recognition model is trained through the second training set, so that the generalization capability of the second voice recognition model can be improved, and the recognition performance of the second voice recognition model is improved.

In the embodiment of the application, the recognition performance of each alternative voice recognition model is determined by using the test set, so that the second voice recognition model with good voice recognition performance can be conveniently selected from the alternative voice recognition models, and the voice recognition effect of voice recognition based on the second voice recognition model is ensured.

Fig. 7 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 7, this embodiment includes:

701. the electronic device obtains a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection mode among the networks is not determined, and the networks comprise an input network, an attention network and an output network.

Optionally, the first speech recognition model comprises a convolutional network, a pooled network, or other network in addition to the input network, the attention network, and the output network.

Optionally, in the first speech recognition model, the number of the various networks other than the input network and the output network is 1 is any number, which is not limited in the embodiment of the present application.

702. At least one feature extraction unit is connected with a plurality of networks in the first voice recognition model at least once by the electronic equipment according to at least two connection modes, so that at least two alternative voice recognition models are obtained.

In the case of connecting at least one feature extraction unit with a plurality of networks in the first speech recognition model a plurality of times, the number of at least one feature extraction unit for which the electronic device is connected each time is the same or different. For example, the first time is 1 feature extraction unit, the second time is also 1 feature extraction unit, or the second time is 2 feature extraction units. In the case of multiple connections, the number of feature extraction units to be connected is the same, and the feature extraction units to be connected are the same or different each time. For example, the first feature extraction unit is a convolution network of 1 1*1, the second feature extraction unit is a convolution network of 3*3, or the second feature extraction unit is also a convolution network of 1 1*1. When the feature extraction units connected in pairs are the same, at least one feature extraction unit connected in pairs is connected in a different manner to the plurality of networks in the first speech recognition model, for example, the feature extraction unit connected in pairs for the first time is a convolutional network of 1*1, and the feature extraction unit connected in pairs for the second time is a convolutional network of 1*1, but the manner in which the convolutional network and the plurality of networks in the first speech recognition model are connected in pairs is different.

Wherein, the at least one feature extraction unit is connected with the plurality of networks in the first voice recognition model according to at least two connection modes: each feature extraction unit in the at least one feature extraction unit and each network in the first speech recognition model are connection objects, and after the plurality of connection objects are connected, at least two alternative speech recognition models are obtained, wherein each alternative speech recognition model corresponds to one connection mode.

703. The electronic device selects a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models.

In the embodiment of the present application, the structure of the speech recognition model is not designed by the user entirely, but a plurality of candidate speech recognition models are automatically created by connecting at least one feature extraction network with a plurality of networks existing in the first speech recognition model, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the structure of the obtained second speech recognition model can get rid of the limitation of human experience. And the second voice recognition model comprises an attention network, so that the second voice recognition model can utilize an attention mechanism to improve the voice recognition performance of the voice recognition model when performing voice recognition.

Fig. 8 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 8, this embodiment includes:

801. the electronic device obtains a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection mode among the networks is not determined, and the networks comprise an input network, an attention network and an output network.

802. The electronic equipment selects at least one feature extraction unit from the plurality of feature extraction units at least once, and connects the at least one feature extraction unit with a plurality of networks in the first voice recognition model according to at least two connection modes to obtain at least two alternative voice recognition models.

Optionally, at least one feature extraction unit selected by the electronic device each time is different, so that the electronic device can obtain a plurality of different alternative voice recognition models by connecting the at least one feature extraction unit selected each time with a plurality of networks in the first voice recognition model according to at least two connection modes.

In one possible implementation manner, the electronic device selects at least one feature extraction unit from a plurality of feature extraction units, including: the electronic equipment selects any number from the second number range; the electronic device selects the number of feature extraction units from the plurality of feature extraction units. Alternatively, the second number range is any number range, for example, the second number range is 1-5, which is not limited in the embodiments of the present application.

In one possible implementation, the electronic device selects the number of feature extraction units from a plurality of feature extraction units, including: the method comprises the steps that the electronic equipment determines a plurality of unit sets corresponding to the number, and each unit set comprises a feature extraction unit with the number; the electronic device selects each feature extraction unit in any unit set. Wherein each set of cells corresponds to a combination of feature extraction cells.

In the embodiment of the application, since each unit set corresponds to a combination form of feature extraction units, the feature extraction units are selected by using the unit sets, so that the combination forms of the feature extraction units selected each time can be ensured to be different, and the structure of the alternative voice recognition model constructed based on the selected feature extraction units is ensured to be different.

The connection manner between the at least one feature extraction unit and the plurality of networks in the first speech recognition model includes bi-chained, or dense-connected, which is not limited in the embodiment of the present application.

In the embodiment of the application, a plurality of connection modes between the feature extraction unit and the network in the first voice recognition model are provided, so that after at least one feature extraction unit is acquired, the plurality of networks in the first voice recognition model and the selected at least one feature extraction unit can be connected in a plurality of connection modes, a plurality of alternative voice recognition models with different structures are obtained, the number of the alternative voice recognition models is increased, and the second voice recognition model with higher recognition performance can be conveniently selected from the number of the alternative voice recognition models.

In one possible implementation manner, at least one feature extraction unit is connected to a plurality of networks in the first speech recognition model at least once in at least two connection manners, and before at least two alternative speech recognition models are obtained, the electronic device needs to obtain the feature extraction unit first, where the implementation manner is as follows: the electronic device obtains at least one feature extraction unit based on a plurality of feature extraction networks, each feature extraction unit obtained comprising at least one feature extraction network.

In one possible implementation, the electronic device obtains at least one feature extraction unit based on a plurality of feature extraction networks, including: the electronic equipment selects a feature extraction network from the first network set, and determines the feature extraction network as a feature extraction unit; or the electronic equipment selects at least two feature extraction networks from the first network set, and connects the at least two feature extraction networks to obtain a feature extraction unit; wherein the first network set includes a plurality of alternative feature extraction networks.

In the embodiment of the application, the feature extraction unit is obtained based on the feature extraction network, that is, the internal structure of the feature extraction unit is searched, that is, when the structure of the voice recognition model is searched, the embodiment of the application searches not only the macrostructure of the voice recognition model, but also the microstructure of the interior of the feature extraction unit, so that the alternative voice recognition model with more abundant structure types can be obtained, and the second voice recognition model with high voice recognition performance is conveniently selected.

In one possible implementation, the electronic device selects at least two feature extraction networks from a first network set, including: the electronic equipment selects any number from a first number range, and the number in the first number range is not less than 2; the number of feature extraction networks is selected from the first set of networks.

In one possible implementation, the first network set includes a plurality of different second network sets, the number of feature extraction networks selected from the first network set includes: the electronic equipment determines a plurality of second network sets corresponding to the number from the first network sets, wherein each second network set corresponding to the number comprises the feature extraction network of the number; the electronic device selects each feature extraction network in a second network set corresponding to the number. It should be noted that, the implementation of selecting at least one feature extraction network from the first network set is described in step 302, and will not be described in detail herein.

In one possible implementation manner, after the electronic device selects at least two feature extraction networks from the first network set, the at least two feature extraction networks are connected to obtain a feature extraction unit, including: and the electronic equipment connects the at least two feature extraction networks in at least two connection modes to obtain at least two feature extraction units.

In the embodiment of the application, after at least two feature extraction networks are selected from the first network set, the at least two feature extraction networks are connected in at least two connection modes, so that a plurality of feature extraction units with different structures can be obtained, the structure types of the feature extraction units are expanded, the structures of the voice recognition models are searched based on the plurality of feature extraction units, the number of alternative voice recognition models is expanded, and the second voice recognition model with higher recognition performance can be conveniently selected from the second voice recognition models.

In one possible implementation, the connection between the at least two feature extraction networks includes a double chain bi-chain-style, a chain-style, or a dense-connected, etc.

In the embodiment of the application, a plurality of connection modes between the feature extraction networks are provided, so that after at least two feature extraction networks are acquired, the at least two feature extraction networks can be connected in a plurality of connection modes to obtain a plurality of feature extraction units with different structures, the structure types of the feature extraction units are expanded, the structures of the voice recognition models are searched based on the plurality of feature extraction units, the number of alternative voice recognition models is expanded, and the second voice recognition models with higher recognition performance can be conveniently selected from the number of alternative voice recognition models.

803. The electronic device determines recognition performance for each of the candidate speech recognition models.

In one possible implementation, the electronic device determines recognition performance for each of the candidate speech recognition models, including: the electronic equipment acquires a test set, wherein the test set comprises first sample voice and first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample text.

It should be noted that, the implementation manner of determining the recognition performance of each candidate speech recognition model through the test set and the implementation manner of training each candidate speech recognition model through the first training set are described in step 303, and are not described herein.

804. The electronic device selects a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models.

In one possible implementation manner, after the electronic device selects the second speech recognition model from the at least two candidate speech recognition models according to the recognition performance of the at least two candidate speech recognition models in response to obtaining the at least two candidate speech recognition models, the method further includes: the electronic device responding to the selection operation of the first feature extraction unit in the second voice recognition model, and creating a second feature extraction unit which is the same as the first feature extraction unit; and the electronic equipment adds the second characteristic extraction unit into the second voice recognition model and is connected with the first characteristic extraction unit to obtain an updated second voice recognition model. Optionally, the first feature extraction unit is any feature extraction unit in the second speech recognition model.

Optionally, the electronic device adds the second feature extraction unit to the second speech recognition model, and connects with the first feature extraction unit, so as to obtain an updated implementation manner of the second speech recognition model, which is as follows: the electronic device inserts the second feature extraction unit between the first feature extraction unit and other networks or units, and the connection mode of the inserted second feature extraction unit and the upper and lower networks or units is the same as the connection mode of the first feature extraction unit and the upper and lower networks or units before insertion. Of course, the electronic device may also be capable of adding the second feature extraction unit to the second speech recognition model in other ways, which the embodiments of the present application do not limit. Optionally, the number of the second feature extraction units is any number, and the implementation manner of adding each second feature extraction unit in the second speech recognition model is the same.

805. The electronic device performs speech recognition based on the second speech recognition model.

In one possible implementation manner, in the process of performing voice recognition by the electronic device based on the second voice recognition model, the shape of the voice feature input to the attention network is c×t×f, the number of channel dimensions included in the voice feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and both C, T and F are positive integers; the process of voice recognition by the electronic device based on the attention network comprises the following steps: the electronic device transforms the shape of the speech feature into T x Z such that the transformed speech feature no longer comprises a channel dimension and a frequency dimension, and the feature size in each time dimension is Z, where Z is the product of C and F; based on the transformed voice feature, determining the attention weight corresponding to the voice feature, recovering the shape of the weighted voice feature to be C x T x F, and outputting the voice feature with the shape recovered.

In one possible implementation, before the electronic device performs voice recognition based on the second voice recognition model, the method further includes: the electronic equipment acquires a second training set, wherein the second training set comprises a third sample voice and a third sample text corresponding to the third sample voice; the electronic equipment respectively identifies the third sample voice based on each alternative voice identification model, and trains the second voice identification model according to the error between the text obtained by identification and the third sample text. The process of the electronic device training the second speech recognition model through the second training set is the same as the process of training the alternative speech recognition model through the first training set, and will not be described in detail here.

In the embodiment of the present application, the structure of the speech recognition model is not designed by the user entirely, but a plurality of candidate speech recognition models are automatically created by connecting at least one feature extraction unit with a plurality of networks existing in the first speech recognition model according to a plurality of connection modes, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the structure of the obtained second speech recognition model can get rid of the limitation of human experience. And the second voice recognition model comprises an attention network, so that the second voice recognition model can utilize an attention mechanism to improve the voice recognition performance of the voice recognition model when performing voice recognition.

In the embodiment of the application, the recognition performance of each alternative voice recognition model is determined by using the test set, so that the second voice recognition model with good voice recognition performance can be conveniently selected from the alternative voice recognition models, and the recognition performance of the second voice recognition model for voice recognition is ensured.

In the embodiment of the application, the depth of the second voice recognition model can be increased by adding the same feature extraction network as the existing feature extraction network into the obtained second voice recognition model, so that the recognition performance of the second voice recognition model is further improved.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

Fig. 9 is a block diagram of a voice recognition apparatus according to an embodiment of the present application. Referring to fig. 9, this embodiment includes:

a model obtaining module 91, configured to obtain a first speech recognition model, where the first speech recognition model includes an input network, a first feature extraction unit, and an output network, and connection modes among the input network, the first feature extraction unit, and the output network are determined, and the first feature extraction unit includes an attention network;

a network adding module 92, configured to add at least one feature extraction network to the first feature extraction unit at least once, and connect with an attention network to obtain an alternative speech recognition model;

the model selection module 93 is configured to select, in response to obtaining at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models according to recognition performance of the at least two candidate speech recognition models.

In a possible implementation manner, the model obtaining module 91 is configured to connect the plurality of first feature extraction units according to a dual-chain bi-stable connection manner, a chain-stable connection manner, or a dense connected-stable connection manner, so as to obtain a unit chain; and connecting an input network and an output network at two ends of the unit chain respectively to obtain a first voice recognition model.

In a possible implementation, the network adding module 92 is configured to add at least one feature extraction network to the first feature extraction unit in a different manner, and connect with the attention network to obtain different alternative speech recognition models.

In one possible implementation, the network adding module 92 is configured to add at least one feature extraction network to the first feature extraction unit, and connect with the attention network according to a bi-chain-style connection, a chain-style connection, or a dense-connected connection, to obtain an alternative speech recognition model.

In one possible implementation, the first speech recognition model includes a plurality of first feature extraction units, and a connection manner between the plurality of first feature extraction units is determined; the connection manner of the plurality of networks in the first feature extraction unit is different from the connection manner between the plurality of first feature extraction units.

In one possible implementation manner, the first speech recognition model includes N-1 first feature extraction units and N unit groups, each unit group includes M second feature extraction units, N is an integer greater than 1, M is a positive integer, the second feature extraction units do not include an attention network, and the connection manner of the network in the first speech recognition model is as follows: the two ends of the first voice recognition model are an input network and an output network, wherein a unit packet is connected after the input network, a unit packet is connected before the output network, and a first feature extraction unit is connected between every two unit packets.

In one possible implementation, referring to fig. 10, the apparatus further includes:

a model updating module 94 for creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model; and adding the fourth feature extraction unit into the second voice recognition model, and connecting with the third feature extraction unit to obtain an updated second voice recognition model.

In one possible implementation manner, in the process of performing voice recognition based on the second voice recognition model, the shape of the voice feature input to the attention network is c×t×f, the number of channel dimensions included in the voice feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and both F and C, T are positive integers;

the process of speech recognition based on the attention network comprises:

transforming the shape of the speech feature to T x Z such that the transformed speech feature no longer comprises a channel dimension and a frequency dimension, and the feature size in each time dimension is Z, where Z is the product of C and F;

and determining attention weights corresponding to the voice features based on the transformed voice features, weighting the transformed voice features based on the attention weights, recovering the shape of the weighted voice features to be C.T.F, and outputting the voice features with the recovered shape.

In one possible implementation, referring to fig. 10, the network adding module 92 includes:

a network selection sub-module 921 for selecting at least one feature extraction network from the first network set at least once;

a network adding sub-module 922, configured to add at least one feature extraction network to the first feature extraction unit, and connect with the attention network to obtain an alternative speech recognition model;

In one possible implementation, the network selection sub-module 921, referring to fig. 10, includes:

a number selecting unit 9211 for selecting any number from the first number range;

a network selection unit 9212, configured to select a number of feature extraction networks from the first network set.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selecting unit 9212 is configured to determine a plurality of second network sets corresponding to the number from the first network sets, where each second network set corresponding to the number includes a number of feature extraction networks; and selecting each feature extraction network in one second network set corresponding to the number.

In a possible implementation manner, the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit does not comprise an attention network, and connection modes among the input network, the first feature extraction unit, the second feature extraction unit and the output network are determined; the network adding module 92 is further configured to add at least one feature extraction network to the second feature extraction unit at least once, to obtain an alternative speech recognition model.

a performance determining module 95, configured to obtain a test set, where the test set includes a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample text.

a first training module 96, configured to obtain a first training set, where the first training set includes a second sample voice and a second sample text corresponding to the second sample voice; and respectively identifying the second sample voice based on each alternative voice identification model, and training each alternative voice identification model according to the error between the text obtained by identification and the second sample text.

the voice recognition module 97 is configured to perform voice recognition based on the second voice recognition model.

In one possible implementation, the apparatus further includes:

a second training module 98, configured to obtain a second training set, where the second training set includes a third sample voice and a third sample text corresponding to the third sample voice; and respectively identifying the third sample voice based on each alternative voice identification model, and training the second voice identification model according to the error between the text obtained by identification and the third sample text.

Fig. 11 is a block diagram of a voice recognition apparatus according to an embodiment of the present application. Referring to fig. 11, this embodiment includes:

the model obtaining module 111 is configured to obtain a first speech recognition model, where the first speech recognition model includes a plurality of networks, and a connection manner between the plurality of networks is not determined, and the plurality of networks includes an input network, an attention network, and an output network;

the unit connection module 112 is configured to connect at least one feature extraction unit with a plurality of networks in the first speech recognition model at least once according to at least two connection modes, so as to obtain at least two alternative speech recognition models;

the model selection module 113 is configured to select a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models according to recognition performance of the at least two candidate speech recognition models.

In one possible implementation, the connection includes a double chain bi-chain-style, or dense-style.

In one possible implementation, referring to fig. 12, an apparatus includes:

the unit acquiring module 114 is configured to acquire at least one feature extraction unit based on a plurality of feature extraction networks, where each feature extraction unit acquired includes at least one feature extraction network.

In one possible implementation, the unit connection module 112, referring to fig. 12, includes:

a unit selection sub-module 1121 for selecting at least one feature extraction unit from the plurality of feature extraction units at least once;

the unit connection sub-module 1122 is configured to connect the plurality of networks in the first speech recognition model with the selected at least one feature extraction unit according to at least two connection modes, so as to obtain at least one alternative speech recognition model.

In one possible implementation, the unit selection sub-module 1121, referring to fig. 12, includes:

a first number selecting unit 11211 for selecting any number from the second number range;

a unit selection unit 11212 for selecting the number of feature extraction units from the plurality of feature extraction units.

In a possible implementation manner, the unit selecting unit 11212 is configured to determine a plurality of unit sets corresponding to the number, where each unit set includes a number of feature extracting units; each feature extraction unit in any unit set is selected.

In one possible implementation, the unit acquisition module 114, referring to fig. 12, includes:

a first unit obtaining submodule 1141, configured to select a feature extraction network from the first network set, and determine the feature extraction network as a feature extraction unit; or,

The second unit obtaining sub-module 1142 is configured to select at least two feature extraction networks from the first network set, connect the at least two feature extraction networks, and obtain a feature extraction unit;

In one possible implementation, the second unit acquisition submodule 1142 includes:

a second number selecting unit 11421 for selecting any number from the first number range, the number in the first number range being not less than 2;

the network selecting unit 11422 is configured to select a number of feature extraction networks from the first network set.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selecting unit 11422 is configured to determine a plurality of second network sets corresponding to the number from the first network sets, where each second network set corresponding to the number includes a number of feature extraction networks; and selecting each feature extraction network in one second network set corresponding to the number.

In one possible implementation, the second unit obtaining sub-module 1142 is configured to connect at least two feature extraction networks in at least two connection manners, to obtain at least two feature extraction units.

In one possible implementation, referring to fig. 12, the apparatus further includes:

a performance determining module 115, configured to obtain a test set, where the test set includes a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample text.

a first training module 116, configured to obtain a first training set, where the first training set includes a second sample voice and a second sample text corresponding to the second sample voice; and respectively identifying the second sample voice based on each alternative voice identification model, and training each alternative voice identification model according to the error between the text obtained by identification and the second sample text.

The voice recognition module 117 is used for performing voice recognition based on the second voice recognition model.

In one possible implementation, the apparatus further includes:

a second training module 118, configured to obtain a second training set, where the second training set includes a third sample voice and a third sample text corresponding to the third sample voice; and respectively identifying the third sample voice based on each alternative voice identification model, and training the second voice identification model according to the error between the text obtained by identification and the third sample text.

In one possible implementation, the apparatus further includes:

a model updating module 119 for creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model; and adding the second feature extraction unit into the second voice recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain an updated second voice recognition model.

The process of speech recognition based on the attention network comprises:

In the embodiment of the present application, the structure of the speech recognition model is not designed by the user entirely, but a plurality of candidate speech recognition models are automatically created by connecting at least one feature extraction unit with a plurality of networks existing in the first speech recognition model according to a plurality of connection modes, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the structure of the obtained second speech recognition model can get rid of the limitation of human experience. And the second voice recognition model comprises an attention network, so that the recognition performance of the voice recognition model can be improved by using an attention mechanism when the second voice recognition model performs voice recognition.

It should be noted that: in the voice recognition device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the electronic device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the voice recognition device and the voice recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not described herein again.

The embodiment of the application also provides an electronic device, which comprises a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor to realize the operations executed in the voice recognition method of the embodiment.

Optionally, the electronic device is provided as a terminal. Fig. 13 shows a block diagram of a terminal 1300 according to an exemplary embodiment of the present application. The terminal 1300 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

The terminal 1300 includes: a processor 1301, and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 1301 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). In some embodiments, the processor 1301 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one computer program for execution by processor 1301 to implement the speech recognition methods provided by the method embodiments herein.

In some embodiments, the terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. The processor 1301, the memory 1302, and the peripheral interface 1303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one of an audio circuit 1304 and a power supply 1305.

A peripheral interface 1303 may be used to connect I/O (Input/Output) related at least one peripheral to the processor 1301 and the memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1301, the memory 1302, and the peripheral interface 1303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The audio circuit 1304 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the audio circuit 1304 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1300, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 1301 or the audio circuit 1304 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 1304 may also include a headphone jack.

A power supply 1305 is used to power the various components in terminal 1300. The power source 1305 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1305 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting of terminal 1300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Optionally, the electronic device is provided as a server. Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 1401 and one or more memories 1402, where at least one computer program is stored in the memories 1402, and the at least one computer program is loaded and executed by the processors 1401 to implement the voice recognition method according to the above-mentioned method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The present application also provides a computer readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to implement the operations performed in the speech recognition method of the above embodiments.

Embodiments of the present application also provide a computer program product or a computer program comprising a computer program stored in a computer readable storage medium. The computer program is read from the computer-readable storage medium by a processor of the electronic device, which executes the computer program, causing the electronic device to perform the operations performed in the speech recognition method in the various alternative implementations described above.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as being included within the spirit and principles of the present invention.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the obtaining a first speech recognition model comprises:

connecting a plurality of first feature extraction units according to a double-chain bi-chain-styled connection mode, a chain-styled connection mode or a dense-connected connection mode to obtain a unit chain;

And respectively connecting the input network and the output network at two ends of the unit chain to obtain the first voice recognition model.

3. The method according to claim 1, wherein said adding at least one feature extraction network into said first feature extraction unit at least twice and connecting with said attention network results in an alternative speech recognition model, comprising:

and adding the at least one feature extraction network into the first feature extraction unit in different modes, and connecting with the attention network to obtain different alternative voice recognition models.

4. The method according to claim 1, wherein said adding at least one feature extraction network to said first feature extraction unit and connecting to said attention network results in an alternative speech recognition model, comprising:

and adding the at least one feature extraction network into the first feature extraction unit, and connecting the feature extraction network with the attention network according to a double-chain bi-stable connection mode, a chain-stable connection mode or a dense-density-connected connection mode to obtain the alternative voice recognition model.

5. The method according to claim 1, wherein the first speech recognition model includes a plurality of the first feature extraction units, and a connection manner between the plurality of the first feature extraction units is determined; the connection mode of a plurality of networks in the first feature extraction unit is different from the connection mode of a plurality of first feature extraction units.

6. The method of claim 1, wherein the first speech recognition model includes N-1 first feature extraction units and N groupings of units, each grouping of units including M second feature extraction units, N being an integer greater than 1, M being a positive integer, the second feature extraction units not including the attention network, the network in the first speech recognition model being connected in a manner that: the two ends of the first voice recognition model are the input network and the output network, one unit packet is connected after the input network, one unit packet is connected before the output network, and one first feature extraction unit is connected between every two unit packets.

7. The method according to any of claims 1-6, wherein in response to obtaining at least two alternative speech recognition models, after selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models, the method further comprises:

Creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model;

and adding the fourth feature extraction unit into the second voice recognition model, and connecting the fourth feature extraction unit with the third feature extraction unit to obtain the updated second voice recognition model.

8. The method according to any one of claims 1-6, wherein during speech recognition based on the second speech recognition model, a shape of a speech feature input to the attention network is c×t×f, the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and all of C, T, and F are positive integers;

the process of speech recognition based on the attention network comprises:

9. The method according to any of the claims 1-6, wherein said at least twice adding at least one feature extraction network to said first feature extraction unit and connecting with said attention network, resulting in an alternative speech recognition model, comprises:

10. The method of claim 9, wherein the first set of networks comprises a plurality of second, different sets of networks, the selecting at least one feature extraction network from the first set of networks comprising:

selecting any number from the first number range;

11. The method according to any of claims 1-6, wherein the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit not comprising the attention network, the manner of connection between the input network, the first feature extraction unit, the second feature extraction unit and the output network being determined;

the method further comprises the steps of:

at least one feature extraction network is added to the second feature extraction unit at least once, resulting in an alternative speech recognition model.

12. The method according to any of claims 1-6, wherein before selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models, the method further comprises:

13. The method according to any of claims 1-6, wherein before selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models, the method further comprises:

14. A speech recognition device, the device comprising:

15. The apparatus of claim 14, wherein the model obtaining module is configured to connect the plurality of first feature extraction units according to a dual-chain bi-threaded connection, a chain-threaded connection, or a dense connected connection to obtain a unit chain; and respectively connecting the input network and the output network at two ends of the unit chain to obtain the first voice recognition model.

16. The apparatus according to claim 14, wherein the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit in a different manner and connect to the attention network to obtain different alternative speech recognition models.

17. The apparatus according to claim 14, wherein the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit, and connect with the attention network according to a bi-chained, or dense-connected connection manner, so as to obtain the alternative speech recognition model.

18. The apparatus according to any one of claims 14-17, wherein the apparatus further comprises:

19. The apparatus according to any of claims 14-17, wherein the network addition module comprises:

20. The apparatus of claim 19, wherein the first set of networks comprises a plurality of second, different sets of networks, the network selection sub-module comprising:

A number selecting unit for selecting any number from the first number range;

a network selection unit, configured to determine a plurality of second network sets corresponding to the number from the first network sets, where each second network set corresponding to the number includes the number of feature extraction networks; and selecting each feature extraction network in one second network set corresponding to the number.

21. The apparatus according to any of claims 14-17, wherein the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit not comprising the attention network, the manner of connection between the input network, the first feature extraction unit, the second feature extraction unit and the output network being determined;

the network adding module is further configured to add at least one feature extraction network to the second feature extraction unit at least once to obtain an alternative speech recognition model.

22. The apparatus according to any one of claims 14-17, wherein the apparatus further comprises:

23. The apparatus according to any one of claims 14-17, wherein the apparatus further comprises:

24. An electronic device comprising a processor and a memory, wherein the memory stores at least one computer program that is loaded and executed by the processor to perform the operations performed by the speech recognition method of any one of claims 1 to 13.

25. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to perform the operations performed by the speech recognition method of any one of claims 1 to 13.