CN113838466A

CN113838466A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN113838466A
Application number: CN202110668257.4A
Authority: CN
Inventors: 苏丹; 贺利强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-12-24
Anticipated expiration: 2041-06-16
Also published as: CN113838466B

Abstract

The application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, and the first feature extraction unit comprises an attention network; adding at least one feature extraction network into the first feature extraction unit at least once, and connecting with the attention network to obtain an alternative voice recognition model; and selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the obtained recognition performances of the at least two alternative speech recognition models. The structure of the second speech recognition model obtained by the method can get rid of the limitation of artificial experience, and the required second speech recognition model can be obtained according to the recognition performance. And, the second speech recognition model can improve the performance of speech recognition using an attention mechanism.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition.

Background

The speech recognition technology is a technology for converting speech into text through a recognition and parsing process. In the related speech recognition technology, speech recognition is usually performed based on a speech recognition model, which requires that the speech recognition model is constructed first.

When a speech recognition model is constructed, a technician usually determines the structure of the speech recognition model manually, and then trains a corresponding speech recognition model according to the determined structure, however, the structure of the speech recognition model is limited by human experience, which may result in poor recognition performance of the speech recognition model.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, and the voice recognition performance of a voice recognition model can be improved. The technical scheme is as follows:

in one aspect, a speech recognition method is provided, and the method includes:

acquiring a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection modes among the input network, the first feature extraction unit and the output network are determined, and the first feature extraction unit comprises an attention network;

adding at least one feature extraction network to the first feature extraction unit at least once, and connecting with the attention network to obtain an alternative voice recognition model;

and responding to the obtained at least two alternative speech recognition models, and selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performances of the at least two alternative speech recognition models.

In a possible implementation manner, the at least one time adding at least one feature extraction network to the first feature extraction unit and connecting with the attention network to obtain an alternative speech recognition model includes:

at least one feature extraction network is selected from a first network set at least once, the at least one feature extraction network is added into the first feature extraction unit and is connected with the attention network, and an alternative voice recognition model is obtained;

wherein the first network set comprises a plurality of candidate feature extraction networks.

In a possible implementation manner, the selecting at least one feature extraction network from the first network set includes:

selecting any quantity from the first quantity range;

selecting the number of feature extraction networks from the first network set.

In a possible implementation manner, the selecting the number of feature extraction networks from the first network set includes:

determining a plurality of second network sets corresponding to the number from the first network set, wherein each second network set corresponding to the number comprises the number of feature extraction networks;

and selecting each feature extraction network in one second network set corresponding to the number.

In a possible implementation, the first speech recognition model further includes a second feature extraction unit, the second feature extraction unit does not include the attention network, and connection manners between the input network, the first feature extraction unit, the second feature extraction unit, and the output network are determined;

the method further comprises the following steps: and adding at least one feature extraction network to the second feature extraction unit at least once to obtain an alternative voice recognition model.

In a possible implementation manner, before the selecting, according to the recognition performance of the at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models, the method further includes:

obtaining a test set, wherein the test set comprises a first sample voice and a first sample text corresponding to the first sample voice;

and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the recognized text and the first sample text.

acquiring a first training set, wherein the first training set comprises second sample voice and a second sample text corresponding to the second sample voice;

and respectively recognizing the second sample voice based on each alternative voice recognition model, and training each alternative voice recognition model according to the error between the recognized text and the second sample text.

In a possible implementation manner, after the selecting the second speech recognition model from the at least two candidate speech recognition models, the method further includes:

and performing voice recognition based on the second voice recognition model.

In a possible implementation manner, before performing the speech recognition based on the second speech recognition model, the method further includes:

acquiring a second training set, wherein the second training set comprises third sample voice and a third sample text corresponding to the third sample voice;

and respectively recognizing the third sample voice based on each alternative voice recognition model, and training the second voice recognition model according to the error between the recognized text and the third sample text.

acquiring a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection modes among the plurality of networks are not determined, and the plurality of networks comprise an input network, an attention network and an output network;

at least one feature extraction unit is connected with a plurality of networks in the first voice recognition model according to at least two connection modes at least once to obtain at least two alternative voice recognition models;

and selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models.

In a possible implementation manner, the at least one feature extraction unit is connected to the plurality of networks in the first speech recognition model according to at least two connection manners at least once to obtain at least two candidate speech recognition models, including:

and selecting at least one feature extraction unit from the plurality of feature extraction units at least once, and connecting the plurality of networks in the first voice recognition model with the selected at least one feature extraction unit according to at least two connection modes to obtain the at least one alternative voice recognition model.

In one possible implementation manner, the selecting at least one feature extraction unit from the plurality of feature extraction units includes:

selecting any number from the second number range;

selecting the number of feature extraction units from the plurality of feature extraction units.

In a possible implementation manner, the selecting the number of feature extraction units from the plurality of feature extraction units includes:

determining a plurality of unit sets corresponding to the number, wherein each unit set comprises the number of feature extraction units;

and selecting each feature extraction unit in any unit set.

In one possible implementation manner, the obtaining at least one feature extraction unit based on a plurality of feature extraction networks includes:

selecting a feature extraction network from a first network set, and determining the feature extraction network as the feature extraction unit; alternatively, the first and second electrodes may be,

selecting at least two feature extraction networks from the first network set, and connecting the at least two feature extraction networks to obtain the feature extraction unit;

In a possible implementation manner, the selecting at least two feature extraction networks from the first network set includes:

selecting any quantity from a first quantity range, wherein the quantity in the first quantity range is not less than 2;

selecting the number of feature extraction networks from the first network set.

In a possible implementation manner, the connecting the at least two feature extraction networks to obtain the feature extraction unit includes:

and connecting the at least two feature extraction networks in at least two connection modes to obtain at least two feature extraction units.

In a possible implementation manner, the connection manner between the at least two feature extraction networks includes a double chain bi-chain-typed, a chain-typed, or a dense densely-connected.

and performing voice recognition based on the second voice recognition model.

In a possible implementation manner, after the selecting, according to the recognition performance of the at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models, the method further includes:

creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model;

and adding the second feature extraction unit into the second voice recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain the updated second voice recognition model.

In a possible implementation manner, in the process of performing speech recognition based on the second speech recognition model, the shape of the speech feature input to the attention network is C × T × F, which indicates that the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and C, T, and F are positive integers;

the process of performing voice recognition based on the attention network includes:

transforming the shape of the speech feature to T x Z such that the transformed speech feature no longer contains a channel dimension and a frequency dimension, and a feature size in each time dimension is the Z, where Z is the product of the C and the F;

determining attention weight corresponding to the voice feature based on the converted voice feature, performing weighting processing on the converted voice feature based on the attention weight, recovering the shape of the voice feature after the weighting processing to C T F, and outputting the voice feature after the shape recovery.

In another aspect, a speech recognition apparatus is provided, the apparatus comprising:

the model acquisition module is used for acquiring a first voice recognition model, the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection mode among the input network, the first feature extraction unit and the output network is determined, and the first feature extraction unit comprises an attention network;

the network adding module is used for adding at least one feature extraction network to the first feature extraction unit at least once and connecting with the attention network to obtain an alternative voice recognition model;

and the model selection module is used for responding to the obtained at least two alternative voice recognition models and selecting a second voice recognition model for voice recognition from the at least two alternative voice recognition models according to the recognition performance of the at least two alternative voice recognition models.

In a possible implementation manner, the model obtaining module is configured to connect the plurality of first feature extraction units according to a double-chain bi-chain-structured connection manner, a chain-structured connection manner, or a dense desely-connected connection manner, so as to obtain a unit chain; and respectively connecting the input network and the output network at two ends of the unit chain to obtain the first voice recognition model.

In a possible implementation manner, the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit in a different manner, and connect with the attention network to obtain different candidate speech recognition models.

In a possible implementation manner, the network adding module is configured to add the at least one feature extraction network to the first feature extraction unit, and connect the at least one feature extraction network and the attention network according to a double-chain bi-chain-structured connection manner, a chain-structured connection manner, or an intensive dense-connected connection manner, so as to obtain the candidate speech recognition model.

In a possible implementation manner, the first speech recognition model comprises a plurality of first feature extraction units, and the connection manner between the first feature extraction units is determined; the connection mode of a plurality of networks in the first feature extraction unit is different from the connection mode of a plurality of first feature extraction units.

In one possible implementation manner, the first speech recognition model includes N-1 first feature extraction units and N unit groups, each unit group includes M second feature extraction units, where N is an integer greater than 1, M is a positive integer, the second feature extraction unit does not include the attention network, and a connection manner of the network in the first speech recognition model is: the two ends of the first speech recognition model are the input network and the output network, one unit group is connected behind the input network, one unit group is connected in front of the output network, and one first feature extraction unit is connected between every two unit groups.

In one possible implementation, the apparatus further includes:

a model updating module for creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model; and adding the fourth feature extraction unit into the second voice recognition model, and connecting the fourth feature extraction unit with the third feature extraction unit to obtain the updated second voice recognition model.

In one possible implementation manner, the network adding module includes:

the network selection submodule is used for selecting at least one characteristic extraction network from the first network set at least once;

the network adding submodule is used for adding the at least one feature extraction network into the first feature extraction unit and connecting the at least one feature extraction network with the attention network to obtain an alternative voice recognition model;

In a possible implementation manner, the network selection sub-module includes:

a quantity selecting unit for selecting any quantity from the first quantity range;

and the network selecting unit is used for selecting the number of the feature extraction networks from the first network set.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selecting unit is configured to determine, from the first network set, a plurality of second network sets corresponding to the number, where each second network set corresponding to the number includes the number of feature extraction networks; and selecting each feature extraction network in one second network set corresponding to the number.

In a possible implementation, the first speech recognition model further includes a second feature extraction unit, the second feature extraction unit does not include the attention network, and connection manners between the input network, the first feature extraction unit, the second feature extraction unit, and the output network are determined; the network adding module is further configured to add at least one feature extraction network to the second feature extraction unit at least once to obtain an alternative speech recognition model.

In one possible implementation, the apparatus further includes:

the performance determination module is used for acquiring a test set, wherein the test set comprises a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the recognized text and the first sample text.

In one possible implementation, the apparatus further includes:

the first training module is used for acquiring a first training set, wherein the first training set comprises second sample voices and second sample texts corresponding to the second sample voices; and respectively recognizing the second sample voice based on each alternative voice recognition model, and training each alternative voice recognition model according to the error between the recognized text and the second sample text.

In one possible implementation, the apparatus further includes:

and the voice recognition module is used for carrying out voice recognition based on the second voice recognition model.

In one possible implementation, the apparatus further includes:

the second training module is used for acquiring a second training set, and the second training set comprises third sample voice and a third sample text corresponding to the third sample voice; and respectively recognizing the third sample voice based on each alternative voice recognition model, and training the second voice recognition model according to the error between the recognized text and the third sample text.

the model acquisition module is used for acquiring a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection modes among the plurality of networks are not determined, and the plurality of networks comprise an input network, an attention network and an output network;

the network connection module is used for connecting at least one feature extraction unit with a plurality of networks in the first voice recognition model according to at least two connection modes at least once to obtain at least two alternative voice recognition models;

and the model selection module is used for selecting a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models.

In one possible implementation, the connection means includes a double chain bi-chain-typed, a chain-typed, or a dense densely-connected.

In one possible implementation, the apparatus includes:

the unit acquiring module is used for acquiring at least one feature extracting unit based on a plurality of feature extracting networks, and each acquired feature extracting unit comprises at least one feature extracting network.

In one possible implementation, the network connection module includes:

a unit selection submodule for selecting at least one feature extraction unit from the plurality of feature extraction units at least once;

and the unit connection sub-module is used for connecting the plurality of networks in the first voice recognition model with the selected at least one feature extraction unit according to at least two connection modes to obtain the at least one alternative voice recognition model.

In a possible implementation manner, the unit selecting sub-module includes:

a first quantity selecting unit for selecting any one quantity from a second quantity range;

and the unit selecting unit is used for selecting the number of the feature extracting units from the plurality of feature extracting units.

In a possible implementation manner, the unit selecting unit is configured to determine a plurality of unit sets corresponding to the number, where each unit set includes the number of feature extracting units; and selecting each feature extraction unit in any unit set.

In one possible implementation manner, the unit obtaining module includes:

the first unit acquisition submodule is used for selecting a feature extraction network from a first network set and determining the feature extraction network as the feature extraction unit; alternatively, the first and second electrodes may be,

the second unit acquisition submodule is used for selecting at least two feature extraction networks from the first network set and connecting the at least two feature extraction networks to obtain the feature extraction unit;

In one possible implementation manner, the second unit obtaining sub-module includes:

a second quantity selecting unit, configured to select any one quantity from a first quantity range, where the quantity in the first quantity range is not less than 2;

In a possible implementation manner, the second unit obtaining sub-module is configured to connect the at least two feature extraction networks in at least two connection manners to obtain at least two feature extraction units.

In one possible implementation, the apparatus further includes:

a model updating module for creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model; and adding the second feature extraction unit into the second voice recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain the updated second voice recognition model.

In another aspect, an electronic device is provided, which includes a processor and a memory, where at least one computer program is stored in the memory, and the computer program is loaded by the processor and executed to implement the operations performed in the speech recognition method in any one of the above possible implementation manners.

In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, and the computer program is loaded and executed by a processor to implement the operations performed in the speech recognition method in any one of the above possible implementation manners.

In yet another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising a computer program stored in a computer readable storage medium. The processor of the electronic device reads the computer program from the computer-readable storage medium, and executes the computer program, so that the electronic device performs the operations performed in the voice recognition method in the above-described various alternative implementations.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, the structure of the speech recognition model is not designed artificially by a user, but a plurality of candidate speech recognition models are automatically created by adding a feature extraction network in a first speech recognition model, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the obtained structure of the second speech recognition model can get rid of the limitation of artificial experience. In addition, the attention network is included in the second voice recognition model, so that the voice recognition performance of the voice recognition model can be improved by utilizing the attention mechanism when the second voice recognition model carries out voice recognition.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 3 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a first speech recognition model provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an alternative speech recognition model provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of an attention network provided in an embodiment of the present application;

FIG. 7 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 8 is a flow chart of a speech recognition method provided by an embodiment of the present application;

fig. 9 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 11 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth," and the like as used herein may be used herein to describe various concepts, but these concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first training sample may be referred to as a training sample, and similarly, a second training sample may be referred to as a first training sample, without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of feature extraction networks includes 3 feature extraction networks, each of which refers to each of the 3 feature extraction networks, and any one of the 3 feature extraction networks may be the first one, the second one, or the third one.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. The machine learning and the deep learning comprise technologies such as artificial neural network, belief network, reinforcement learning, transfer learning, inductive learning, teaching learning and the like.

According to the scheme provided by the embodiment of the application, the voice recognition model can be obtained according to the voice technology of artificial intelligence, natural language processing, machine learning and other technologies, and voice recognition is carried out through the voice recognition model.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a wireless or wired network. Optionally, the terminal 101 is a smartphone, tablet, laptop, desktop computer, smart speaker, smart watch, in-vehicle terminal, video camera, smart hardware/home, medical device, or other terminal. Optionally, the server 102 is an independent physical server, or a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data and artificial intelligence platform.

Optionally, the terminal 101 has installed thereon a target application served by the server 102, and the terminal 101 can implement functions such as data transmission, message interaction, and the like through the target application. Optionally, the target application is a target application in an operating system of the terminal 101, or a target application provided by a third party. The target application has a voice recognition function, and certainly, the target application can also have other functions, which is not limited in this embodiment of the present application. Optionally, the target application is a short video application, a music application, a gaming application, a shopping application, a chat application, or other applications, which the present disclosure does not limit.

In this embodiment, the terminal 101 or the server 102 is configured to obtain a first speech recognition model, adjust a structure of the first speech recognition model to obtain a second speech recognition model, and perform speech recognition based on the second speech recognition model. Or, the server 102 is configured to adjust based on the first speech recognition model to obtain a second speech recognition model, send the second speech recognition model to the terminal 101, and then the terminal 101 receives the second speech recognition model and performs speech recognition based on the second speech recognition model.

The voice recognition method can be applied to various voice recognition scenes. For example, after the server acquires the second speech recognition model through the speech recognition method provided by the application, the server provides the terminal with a calling interface of the second speech recognition model, and after the terminal receives speech input by the user, the terminal calls the second speech recognition model to recognize the speech input by the user based on the calling interface of the second speech recognition model, and outputs a corresponding text. Or after the server acquires the second voice recognition model, the terminal acquires the second voice recognition model from the server, stores the second voice recognition model, subsequently receives the voice input by the user, calls the stored second voice recognition model to recognize the voice, and outputs a corresponding text.

The voice recognition method provided by the embodiment of the application can also be applied to the scene of intelligent question answering. For example, after the terminal acquires the second speech recognition model by the method provided by the application, speech recognition is performed on the input speech to obtain a corresponding text, and then a reply text corresponding to the text is acquired and the reply text is output, or the reply text is converted into speech and the converted speech is output. For example, the user inputs a voice "how today is, the terminal recognizes the voice, obtains a corresponding text" how today is ", and then searches for a reply text corresponding to the text, for example, if the searched reply text is" sunny ", the terminal outputs the text, or the terminal converts the text into a voice" sunny ", and outputs the voice.

In fact, the speech recognition method provided by the present application can also be applied to other speech recognition scenarios, and the embodiment of the present application does not limit this.

Fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 2, the embodiment includes:

201. the electronic equipment acquires a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection mode among the input network, the first feature extraction unit and the output network is determined, and the first feature extraction unit comprises an attention network.

The function of the first speech recognition model is to perform speech recognition, i.e. to convert speech input into the first speech recognition model into corresponding text. The first speech recognition model includes an input network, a first feature extraction unit, and an output network. The input network is used for extracting the characteristics of the input voice and outputting the voice characteristics. Optionally, the speech features include MFCC (Mel-Frequency spectral Coefficients), Fbank (Filter bank based features), etc., and the speech features are in the form of speech spectra. The first feature extraction unit is used for further extracting features of the input voice features and outputting the voice features. The output network is used for converting the input voice characteristics into corresponding texts and outputting the texts. The first feature extraction unit comprises an attention network, the attention network is used for further feature extraction of the input voice features, and when feature extraction is carried out, the accuracy of the extracted voice features is guaranteed through an attention mechanism. Optionally, the first feature extraction unit may further include other networks, such as a convolutional network, a pooling network, and the like, in addition to the attention network.

Optionally, the number of the first feature extraction units is plural. Optionally, the connection mode among the input network, the first feature extraction unit, and the output network is any connection mode, which is not limited in this embodiment of the application.

It should be noted that the speech Feature (Feature Mapping) in the embodiment of the present application can also be referred to as Feature Mapping between hidden layers of the neural network.

202. The electronic equipment adds at least one feature extraction network to the first feature extraction unit at least once and is connected with the attention network to obtain an alternative voice recognition model.

The feature extraction network is used for further extracting features of the input voice features and outputting the voice features. The feature extraction network comprises a convolutional network, a pooling network, and the like. Moreover, for the same feature extraction network, the feature extraction network comprises feature extraction networks with various structures. For example, for a convolutional network, the convolutional network includes a 1 × 1 convolutional network, a 3 × 3 convolutional network, and so on.

Optionally, the connection mode of the at least one feature extraction network and the attention network is any connection mode, which is not limited in this embodiment of the present application.

In the case where at least one feature extraction network is added to the first feature extraction unit a plurality of times, the number of the at least one feature extraction network added by the electronic device at each time is the same or different. For example, 1 feature extraction network is added for the first time, 1 feature extraction network is also added for the second time, or 2 feature extraction networks are added for the second time. When the number of feature extraction networks added in multiple times is the same, the feature extraction networks added in each time are the same or different. For example, 1 convolution network of 1 × 1 is added for the first time, one convolution network of 3 × 3 is added for the second time, or 1 convolution network of 1 × 1 is also added for the second time. In the case where the feature extraction networks added multiple times are the same, the manner in which the feature extraction networks are added to the first feature extraction unit is different, for example, a convolution network of 1 × 1 is added for the first time, and a convolution network of 1 × 1 is added for the second time, but the manner in which the convolution network is added to the first feature extraction unit twice is different. For example, the convolutional network is added to the first feature extraction unit as an upper network of the attention network in the first feature extraction unit for the first time, and the convolutional network is added to the first feature extraction unit as a lower network of the attention network in the first feature extraction unit for the second time.

203. And the electronic equipment responds to the obtained at least two alternative voice recognition models, and selects a second voice recognition model for voice recognition from the at least two alternative voice recognition models according to the recognition performance of the at least two alternative voice recognition models.

The recognition performance represents the voice recognition effect of the voice recognition model, and the better the recognition performance is, the better the voice recognition effect is. Alternatively, the recognition performance of the speech recognition model is represented by a recognition accuracy. Of course, the recognition performance can also be expressed by other parameters, for example, the recognition efficiency, which is not limited by the embodiment of the present application.

Optionally, the second speech recognition model is a candidate speech recognition model with the highest recognition accuracy among the at least two candidate speech recognition models. Or the second speech recognition model is an alternative speech recognition model with the simplest structure, and the recognition accuracy reaches the accuracy threshold. Alternatively, the second speech recognition model is any alternative speech recognition model whose recognition accuracy reaches an accuracy threshold. Or the second speech recognition model is the candidate speech recognition model with the highest recognition efficiency. Alternatively, the second speech recognition model is any alternative speech recognition model whose recognition efficiency reaches an efficiency threshold. The second speech recognition model selected from the alternative speech recognition models is only an exemplary one, and the embodiment of the present application does not limit this.

It should be noted that, step 202-203 is actually a process of searching the structure of the second speech recognition model, that is, on the basis of the first speech recognition model, a plurality of candidate speech recognition models are obtained by searching, and the second speech recognition model can be selected from the candidate speech recognition models if the structures of the plurality of candidate speech recognition models are different.

After the second speech recognition model is selected, speech recognition can be performed based on the second speech recognition model, for example, a target speech is input into the second speech recognition model, and the second speech recognition model recognizes the target speech and outputs a text corresponding to the target speech.

It should be noted that the speech recognition model in the present application is an arbitrary Neural network, for example, CNN (Convolutional Neural Networks), and the present application is not limited thereto.

Fig. 3 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 3, the embodiment includes:

301. the electronic equipment acquires a first voice recognition model, wherein the first voice recognition model comprises an input network, a first feature extraction unit and an output network, the connection mode among the input network, the first feature extraction unit and the output network is determined, and the first feature extraction unit comprises an attention network.

In one possible implementation, the electronic device obtains a first speech recognition model, including: the electronic equipment connects the plurality of first feature extraction units according to a bi-chain-connected mode, a chain-connected mode or a dense-connected mode to obtain a unit chain; and respectively connecting an input network and an output network at two ends of the unit chain to obtain a first voice recognition model. Of course, the plurality of first feature extraction units can also be connected according to other connection manners, which is not limited in this embodiment of the application.

In the embodiment of the application, the plurality of feature extraction units are connected according to the determined connection mode, so that in the process of performing structure search based on the first speech recognition model to obtain the candidate speech recognition model, the connection mode among the plurality of feature extraction units does not need to participate in the search, and the efficiency of performing model structure search can be improved.

In a possible implementation manner, the first speech recognition model further includes a second feature extraction unit, the second feature extraction unit does not include an attention network, and a connection manner between the input network, the first feature extraction unit, the second feature extraction unit, and the output network is determined.

Optionally, the connection modes between the input network, the first feature extraction unit, the second feature extraction unit and the output network are arbitrary connection modes. Alternatively, the number of the second feature extraction units in the first speech recognition model is any number.

Optionally, in a case that the first speech recognition model includes the second feature extraction unit, the electronic device acquires the first speech recognition model, including: the electronic equipment connects the plurality of first feature extraction units and the plurality of second feature extraction units according to a target connection mode to obtain a unit chain; and respectively connecting an input network and an output network at two ends of the unit chain to obtain a first voice recognition model. Optionally, the target connection manner includes bi-chain-connected, or densely-connected, which is not limited in this embodiment of the application.

In one possible implementation, the first speech recognition model includes N-1 first feature extraction units and N unit groups, each unit group including M second feature extraction units, the second feature extraction units not including the attention network. The connection mode of the network in the first speech recognition model is as follows: the two ends of the first speech recognition model are an input network and an output network, a unit group is connected behind the input network, a unit group is connected in front of the output network, and a first feature extraction unit is connected between every two unit groups. Wherein N is an integer greater than 1, and M is a positive integer. For example, N is 3, M is 5, which is not limited in the examples of the present application. Taking N as 3 as an example, the connection order of the networks in the first speech recognition model is: the device comprises an input network, a unit group, a first feature extraction unit, a unit group and an output network.

It should be noted that, as shown in the experiment, the first speech recognition model has better speech recognition performance when connected according to the above connection method than other connection methods.

FIG. 4 is a schematic diagram of a first speech recognition model. Referring to fig. 4, the number of first feature extraction units in the first speech recognition model is 2, the number of unit groups is 3, that is, the number of second feature extraction units is 3 × M, and the structure of the first speech recognition model is: the input network is followed by M second feature extraction networks, the M second feature extraction networks are followed by a first feature extraction network, the first feature extraction network is followed by M second feature extraction networks, the M second feature extraction networks are followed by another first feature extraction network, the first feature extraction network is followed by M second feature extraction networks, the M second feature extraction networks are followed by a first feature extraction network and an output network, and M is any positive integer. Optionally, the input network comprises two convolutional layers. Optionally, the output network includes a fully connected layer and a normalization layer, and of course, the input network and the output network can also include other layers, which is not limited in this embodiment of the present application.

302. The electronic equipment selects at least one feature extraction network from the first network set at least once, adds the at least one feature extraction network to the first feature extraction unit, and is connected with the attention network to obtain the alternative speech recognition model.

Wherein the first network set comprises a plurality of candidate feature extraction networks. Alternatively, there are various types of alternative feature extraction networks, such as convolutional networks, pooled networks. Alternatively, for the same type of feature extraction network, the type of feature extraction network has a plurality of structures. For example, for a convolutional network, it includes a 1 × 1 convolutional network, a 3 × 3 convolutional network, etc.

In one possible implementation, the electronic device selects at least one feature extraction network from a first network set, including: the electronic equipment selects any quantity from the first quantity range; the number of feature extraction networks is selected from the first set of networks. Alternatively, the first number range is any number range, for example, the first number range is 1 to 10, which is not limited by the embodiment of the present application.

The number of feature extraction networks selected by the electronic device determines the number of layers of the networks in the first feature extraction unit. For example, in the first feature extraction unit, there is originally one layer of attention network, the electronic device selects 3 feature extraction networks, and adds them to the first feature extraction unit, where the number of network layers in the first feature extraction unit is 4.

FIG. 5 is a schematic diagram of an alternative speech recognition model. Referring to fig. 5, a plurality of feature extraction units in the alternative speech recognition model are connected in a bi-chain-typed manner, and except for the first feature extraction unit, the input features of each of the remaining feature extraction units are the output features of the first two feature extraction units. The feature extraction unit comprises 4 feature extraction networks inside, the 4 feature extraction networks are connected in a desely-connected mode, namely every two feature extraction networks are connected with each other, and the input features of each feature extraction network in the feature extraction unit are as follows: each feature extraction unit in the feature extraction unit extracts output features of all feature extraction networks in front of the network. Optionally, the 4 feature extraction networks in the feature extraction unit are arbitrary networks.

In one possible implementation, the first set of networks includes a plurality of different second sets of networks. Correspondingly, the electronic device selects the number of feature extraction networks from the first network set, and the method includes: the electronic equipment determines a plurality of second network sets corresponding to the quantity from the first network set, and selects each feature extraction network in one second network set corresponding to the quantity. Wherein each second network set corresponding to the number includes the number of feature extraction networks.

Because there are a plurality of different feature extraction networks in the candidate network set, for selecting a certain number of feature extraction networks, the selected feature extraction networks may have a plurality of combination modes, for example, 2 feature extraction networks are to be selected, and the combination mode of the two feature extraction networks is: a convolutional network and a pooling network, or two convolutional networks with different structures, or two convolutional networks with the same structure, etc. Therefore, the first network set comprises a plurality of second network sets, and each second network set corresponds to a combination mode of the feature extraction networks.

In the embodiment of the present application, since the first network set includes a plurality of second network sets, and each second network set corresponds to a combination form of the feature extraction network, selecting the feature extraction networks by using the second network sets can improve the efficiency of selecting any number of feature extraction networks from the first network set, and ensure that the combination forms of the feature extraction networks selected each time are different, thereby ensuring that the alternative speech recognition models constructed based on the selected feature extraction networks have different structures.

In one possible implementation manner, the first speech recognition model comprises a plurality of first feature extraction units, and the connection manner among the plurality of first feature extraction units is determined; the connection mode of the plurality of networks in the first feature extraction unit is different from the connection mode between the plurality of first feature extraction units. For example, the connection mode between the first feature extraction units is bi-chain-formatted, and the connection mode of the networks in the first feature extraction units is desely-connected. Therefore, the connection mode among the structure modules in the alternative speech recognition model can be enriched, and the structure types of the alternative speech recognition model are enriched.

In one possible implementation manner, the electronic device adds at least one feature extraction network to the first feature extraction unit and connects with the attention network to obtain an alternative speech recognition model, including: the electronic equipment adds at least one feature extraction network into the first feature extraction unit, and is connected with the attention network according to a double-chain bi-chain-typed connection mode, a chain-chain typed connection mode or a dense desely-connected connection mode to obtain the alternative voice recognition model. Of course, the at least one feature extraction network and the attention network can also be connected in other ways, which is not limited in this embodiment of the application.

In the embodiment of the present application, multiple connection modes for connecting the feature extraction network and the attention network are provided, and after at least one feature extraction network is obtained, the at least one feature extraction network and the attention network in the first feature extraction unit can be connected in multiple modes, so that multiple candidate speech recognition models with different structures are obtained, the number of the candidate speech recognition models is increased, and a second speech recognition model with higher recognition performance is conveniently selected from the candidate speech recognition models. And the attention network is arranged in each of the plurality of candidate voice recognition models, so that the recognition performance of the voice recognition model can be improved by utilizing the attention mechanism when the second voice recognition model carries out voice recognition.

In a possible implementation manner, in a case that the first speech recognition model further includes the second feature extraction unit, the method further includes: and the electronic equipment adds at least one feature extraction network to the second feature extraction unit at least once to obtain the alternative voice recognition model.

Optionally, in a case that the second feature extraction unit does not include a feature extraction network, the electronic device adds at least one feature extraction network to the second feature extraction unit in an implementation manner that: the electronic equipment connects the plurality of feature extraction networks to obtain a network chain, the input end of the network chain is determined as the input end of the second feature extraction unit, and the output end of the network chain is determined as the output end of the second feature extraction unit. In the case that the second feature extraction unit includes the feature extraction network, the electronic device adds at least one feature extraction network to the second feature extraction unit in an implementation manner that: the electronic device connects at least one feature extraction network with an original feature extraction network in a second feature extraction unit to obtain a network chain, determines an input end of the network chain as an input end of the second feature extraction unit, and determines an output end of the network chain as an output end of the second feature extraction unit.

Optionally, the electronic device adds at least one feature extraction network to the second feature extraction unit to obtain an alternative speech recognition model, including: and the electronic equipment adds at least one feature extraction network into the second feature extraction unit in different adding modes to obtain different alternative speech recognition models. Optionally, in a case that the second feature extraction unit does not include a feature extraction network, the electronic device connects the plurality of feature extraction networks in different connection manners to obtain a network chain, determines an input end of the network chain as an input end of the second feature extraction unit, and determines an output end of the network chain as an output end of the second feature extraction unit. Optionally, in a case that the second feature extraction unit includes a feature extraction network, the electronic device connects at least one feature extraction network with an original feature extraction network in the second feature extraction unit in a different manner to obtain a network chain, determines an input end of the network chain as an input end of the second feature extraction unit, and determines an output end of the network chain as an output end of the second feature extraction unit.

The manner of acquiring the at least one feature extraction network added to the second feature extraction unit is the same as the manner of acquiring the at least one feature extraction network added to the first feature extraction unit, and is not described herein again.

It should be noted that step 302 is actually a process of searching for the structure of the first feature extraction unit, and since the first speech recognition model includes the first feature extraction unit, the process of searching for the structure of the first feature extraction unit, that is, the process of searching for the structure of the candidate speech recognition model. Similarly, in the case that the first speech recognition model includes the second feature extraction unit, the process of the electronic device adding at least one feature extraction network to the second feature extraction unit at least once to obtain the candidate speech recognition model is a process of searching the structure of the second feature extraction unit. Optionally, the two processes are performed separately, for example, after the electronic device searches the structure of the first feature extraction unit first, the electronic device searches the structure of the second feature extraction unit, for example, after the electronic device adds at least one feature extraction unit to the first feature extraction unit multiple times to obtain multiple candidate speech recognition models, for each candidate speech recognition model, at least one feature extraction unit is added to the second feature extraction unit in the candidate speech recognition model at least once to obtain a new candidate speech recognition model. Or, after the electronic device searches the structure of the second feature extraction unit first, the electronic device searches the structure of the first feature extraction unit, for example, after the electronic device adds at least one feature extraction unit to the second feature extraction unit multiple times to obtain multiple candidate speech recognition models, for each candidate speech recognition model, at least one feature extraction unit is added to the first feature extraction unit in the candidate speech recognition model at least once to obtain a new candidate speech recognition model. Or, the structures of the first feature extraction unit and the second feature extraction unit are searched simultaneously, for example, after at least one feature extraction network is added to the first feature extraction unit each time to obtain the candidate speech recognition model, at least one feature extraction network is added to the second feature extraction unit in the candidate speech recognition model to obtain a new candidate speech recognition model.

Optionally, the first feature extraction unit and the second feature extraction unit have different network structures except for the attention network. Compared with the voice features input into the first feature extraction unit, the voice features output from the first feature extraction unit have the length and width of the feature size reduced by half respectively, and the instant frequency domain resolution is reduced by half. The feature size of the speech feature output from the second feature extraction unit is kept unchanged relative to the speech feature input to the second feature extraction unit, i.e. the frequency domain resolution is kept unchanged.

It should be noted that, in step 302, at least one feature extraction network is added to the first feature extraction unit at least once and connected to the attention network to obtain one implementation method of the candidate speech recognition model, and certainly, the candidate speech recognition model can also be obtained by other methods, for example, at least one feature extraction network is added to the first feature extraction unit in a different manner and connected to the attention network to obtain a different candidate speech recognition model.

In the embodiment of the application, after the at least one feature extraction network is obtained, the at least one feature extraction network is added to the first feature extraction unit in different adding manners, so that a plurality of alternative speech recognition models with different structures can be obtained, the number of the alternative speech recognition models is increased, and a second speech recognition model with higher recognition performance can be conveniently selected from the alternative speech recognition models.

Optionally, the electronic device adds the at least one feature extraction network to the first feature extraction unit in an implementation manner that: the electronic equipment connects at least one feature extraction network with the attention network in the first feature extraction unit to obtain a network chain, the input end of the network chain is determined as the input end of the first feature extraction unit, and the output end of the network chain is determined as the output end of the first feature extraction unit. Correspondingly, the electronic device adds at least one feature extraction network to the first feature extraction unit in different manners, and connects with the attention network to obtain different alternative speech recognition models, including: the electronic equipment connects at least one feature extraction network and the attention network in the first feature extraction unit in different connection modes to obtain a network chain, an input end of the network chain is determined as an input end of the first feature extraction unit, and an output end of the network chain is determined as an output end of the first feature extraction unit. Optionally, the electronic device connects the at least one feature extraction network and the attention network in the first feature extraction unit in different connection manners, so as to obtain a network chain, including: the electronic equipment connects a plurality of feature extraction networks in different connection modes to obtain a network chain, and then connects the attention network to the output end of the network chain to obtain a final network chain.

303. The electronic device determines a recognition performance of each alternative speech recognition model in response to obtaining at least two alternative speech recognition models.

In one possible implementation, the electronic device determines a recognition performance of each alternative speech recognition model, including: the method comprises the steps that electronic equipment obtains a test set, wherein the test set comprises a first sample voice and a first sample text corresponding to the first sample voice; the electronic equipment respectively identifies the first sample voice based on each alternative voice identification model, and determines the identification performance of each alternative voice identification model according to the text obtained by identification and the first sample text.

In the embodiment of the application, the recognition performance of each candidate speech recognition model is determined by using the test set, so that a second speech recognition model with good speech recognition performance can be conveniently selected from the candidate speech recognition models, and the recognition performance of speech recognition based on the second speech recognition model is ensured.

Optionally, the electronic device determines, according to the recognized text and the first sample text, the recognition performance of each alternative speech recognition model in an implementation manner that: and the electronic equipment determines a loss value of each alternative speech recognition model according to the recognized text and the first sample text, wherein the loss value is used for representing the recognition performance of the alternative speech recognition model, and the loss value and the recognition performance of the alternative semantic recognition model are in a negative correlation relationship, namely the smaller the loss value, the better the recognition performance is. Optionally, the electronic device determines the loss value of each candidate speech recognition model through an arbitrary loss function, which is not limited in this embodiment of the present application.

In one possible implementation manner, before the electronic device determines the recognition performance of each candidate speech recognition model, each candidate speech recognition model is trained, and the implementation manner is as follows: the electronic equipment acquires a first training set, wherein the first training set comprises second sample voice and second sample text corresponding to the second sample voice; and the electronic equipment respectively identifies the second sample voice based on each alternative voice identification model, and trains each alternative voice identification model according to the error between the text obtained by identification and the second sample text.

Optionally, the implementation manner of the electronic device training each alternative speech recognition model according to the error between the recognized text and the second sample text is as follows: the electronic device adjusts model parameters of each alternative speech recognition model to reduce an error between the text recognized based on each adjusted alternative speech recognition model and the second sample text. Optionally, the number of the second sample speeches used for training the candidate speech recognition model in the first training set is any number, which is not limited in this embodiment of the present application.

In the embodiment of the application, before the recognition performance of each candidate speech recognition model is determined, each candidate speech recognition model is trained by using the first training set, and then when a second speech recognition model is selected from the candidate speech recognition models based on the recognition performance, the second speech recognition model with strong learning capability and generalization capability can be selected.

304. And the electronic equipment selects a second speech recognition model for semantic recognition from the at least two alternative speech recognition models according to the recognition performance of the at least two alternative speech recognition models.

Optionally, the electronic device selects, according to the recognition performance of the at least two candidate speech recognition models, the second speech recognition model from the at least two candidate speech recognition models in an implementation manner that: and the electronic equipment selects a second speech recognition model with the best recognition performance from the at least two candidate speech recognition models according to the recognition performances of the at least two candidate speech recognition models. For example, the electronic device selects the second speech recognition model with the highest recognition accuracy from the at least two candidate speech recognition models according to the recognition accuracy of the at least two candidate speech recognition models. Therefore, the accuracy rate of subsequent voice recognition based on the second voice recognition model can be guaranteed to be the highest.

Optionally, the electronic device selects, according to the recognition accuracy of the at least two candidate speech recognition models, the second speech recognition model from the at least two candidate speech recognition models in an implementation manner that: and the electronic equipment selects a second speech recognition model with the simplest model structure from a plurality of candidate speech recognition models with the recognition accuracy rate larger than an accuracy rate threshold value according to the recognition accuracy rates of the at least two candidate speech recognition models. Therefore, the accuracy of voice recognition based on the second voice recognition model is guaranteed, and meanwhile the efficiency of voice recognition of the second voice recognition model can be improved. Of course, the second speech recognition model can also be selected in other manners, for example, any candidate speech recognition model is selected from a plurality of candidate speech recognition models with a recognition accuracy rate greater than an accuracy rate threshold value, and the selected candidate speech recognition model is used as the second speech recognition model, which is not limited in this embodiment of the application.

In one possible implementation manner, after the electronic device selects, in response to obtaining the at least two candidate speech recognition models and according to the recognition performances of the at least two candidate speech recognition models, a second speech recognition model from the at least two candidate speech recognition models, the method further includes: the electronic equipment responds to the selection operation of a third feature extraction unit in the second voice recognition model, and a fourth feature extraction unit identical to the third feature extraction unit is created; and the electronic equipment adds the fourth feature extraction unit into the second voice recognition model and is connected with the third feature extraction unit to obtain the updated second voice recognition model. Optionally, the third feature extraction unit is any feature extraction unit in the second speech recognition model.

Optionally, the electronic device adds the fourth feature extraction unit to the second speech recognition model, and is connected to the third feature extraction unit, and the implementation manner of obtaining the updated second speech recognition model is as follows: the electronic device inserts the fourth feature extraction unit between the third feature extraction unit and the other feature extraction units, and the connection mode of the inserted fourth feature extraction unit and the feature extraction units of the upper layer and the lower layer is the same as the connection mode of the third feature extraction unit and the feature extraction units of the upper layer and the lower layer before insertion. Of course, the electronic device can also add the fourth feature extraction unit to the second speech recognition model in other ways, which is not limited by the embodiment of the present application. Optionally, the number of the fourth feature extraction units is any number, and the implementation manner of adding each fourth feature extraction unit in the second speech recognition model is the same.

In the embodiment of the present application, by adding the feature extraction unit that is the same as the existing feature extraction unit to the obtained second speech recognition model, the depth of the second speech recognition model can be increased, and the recognition performance of the second speech recognition model can be further improved.

305. The electronic device performs speech recognition based on the second speech recognition model.

In a possible implementation manner, in the process of performing speech recognition by the electronic device based on the second speech recognition model, the shape of the speech feature input to the attention network is C × T × F, which indicates that the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, and the number of frequency dimensions is F, wherein C, T and F are positive integers. The process of the electronic equipment for carrying out voice recognition based on the attention network comprises the following steps: the electronic device transforms the shape of the speech feature to T x Z such that the transformed speech feature no longer contains a channel dimension and a frequency dimension, and a feature size in each time dimension is Z, where Z is the product of C and F; the electronic equipment determines attention weight corresponding to the voice feature based on the converted voice feature, performs weighting processing on the converted voice feature based on the attention weight, restores the shape of the voice feature after the weighting processing to C T F, and outputs the voice feature after the shape restoration.

In the embodiment of the application, the shape of the voice feature is transformed based on the attention network, so that the transformed voice feature does not include a channel dimension and a frequency dimension any more, and when the attention weight is generated based on the transformed voice feature, the attention weight is not limited to the voice feature in the channel, and the attention weight can be generated by combining the correlation between the channels of the voice feature, so that the generated attention weight is more accurate, the accuracy of the voice feature output by the attention network is improved, and the voice recognition performance is improved.

Alternatively, the Attention network may be any type of Attention network, such as a Self-Attention (Self-Attention) network, a Multi-Head Attention (Multi-Head Attention) network, a Multi-Head Self-Attention (Multi-Head Self-Attention) network, and the like, which is not limited in this embodiment.

Fig. 6 is a schematic structural diagram of an attention network, and referring to fig. 6, a portion in a dashed box represents an attention network, and it can be seen from the figure that an upper network and a lower network of the attention network are respectively convolution networks, a voice feature input to the attention network is in a shape of C T F, the voice feature is subjected to shape transformation to obtain a voice feature in a shape of T Z, where Z is a product of C and F, and then the voice feature in the shape of T Z is subjected to feature mapping through three fully-connected layers, respectively, to obtain Q (queries), K (keys), V (ues) corresponding to the attention mechanism, where Q, K and V both represent matrices formed by input voice features of different time dimensions, the shape of T Z, K is transposed, the shape of K is Z T, then Q and K are multiplied to obtain T, T represents output voice feature corresponding to each time dimension, attention weights corresponding to input voice features of different time dimensions pass through a softmax layer (normalization layer) to obtain normalized attention weights, the normalized attention weights are multiplied by V to obtain output voice features, the shapes of the output voice features are restored to C T F, and then the output voice features are input into a next convolution network, wherein the input voice features refer to voice features of an input attention network, and the output voice features refer to voice features of an output attention network.

It should be noted that, in the embodiment of the present application, the attention mechanism is fully utilized to enable a better modeling capability to be performed on the long-term sequence correlation, and therefore, a neural network model of the attention mechanism is utilized to search for a model structure with better performance.

In one possible implementation manner, before the electronic device performs the speech recognition based on the second speech recognition model, the method further includes: the electronic equipment acquires a second training set, wherein the second training set comprises third sample voice and third sample text corresponding to the third sample voice; and the electronic equipment respectively identifies the third sample voice based on each alternative voice identification model, and trains the second voice identification model according to the error between the text obtained by identification and the third sample text. Wherein the second training set is different from the first training set. Optionally, the number of the third sample speeches used in the second speech recognition model in the second training set is any number, which is not limited in this embodiment of the present application. The process of training the second speech recognition model by the electronic device through the second training set is the same as the process of training the alternative speech recognition model through the first training set, and is not repeated here.

In the embodiment of the application, after the second speech recognition model is obtained through searching, the second speech recognition model is trained through the second training set, so that the generalization capability of the second speech recognition model can be improved, and the recognition performance of the second speech recognition model is improved.

In the embodiment of the present application, since the first network set includes a plurality of second network sets, and each second network set corresponds to a combination form of the feature extraction networks, selecting the feature extraction networks by using the second network sets can improve the efficiency of selecting any number of feature extraction networks from the first network set, and ensure that the combination forms of the feature extraction networks selected each time are different, thereby ensuring that the structures of candidate speech recognition models constructed based on the selected feature extraction networks are different.

In the embodiment of the application, the recognition performance of each candidate speech recognition model is determined by using the test set, so that a second speech recognition model with good speech recognition performance can be conveniently selected from the candidate speech recognition models, and the speech recognition effect of speech recognition based on the second speech recognition model is ensured.

Fig. 7 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 7, the embodiment includes:

701. the electronic equipment acquires a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection modes among the plurality of networks are not determined, and the plurality of networks comprise an input network, an attention network and an output network.

Optionally, the first speech recognition model comprises a convolutional network, a pooling network, or other network in addition to the input network, the attention network, and the output network.

Optionally, in the first speech recognition model, the number of the various networks is any number except that the number of the input networks and the number of the output networks are 1, which is not limited in the embodiment of the present application.

702. The electronic equipment connects at least one feature extraction unit with a plurality of networks in the first speech recognition model according to at least two connection modes at least once to obtain at least two alternative speech recognition models.

In a case where at least one feature extraction unit is connected to a plurality of networks in the first speech recognition model a plurality of times, the number of the at least one feature extraction unit to which the electronic device is connected at each time is the same or different. For example, the first time is 1 feature extraction unit, the second time is also 1 feature extraction unit, or the second time is 2 feature extraction units. When the number of the feature extraction units for the connection is the same, the feature extraction units for each connection are the same or different. For example, the first time feature extraction unit is 1 × 1 convolution network, the second time feature extraction unit is 3 × 3 convolution network, or the second time feature extraction unit is also 1 × 1 convolution network. In the case where the feature extraction units connected in pairs at a plurality of times are the same, at least one feature extraction unit connected in pairs is connected in a different manner to the plurality of networks in the first speech recognition model, for example, the feature extraction unit connected in pairs at the first time is a 1 × 1 convolutional network, and the feature extraction unit connected in pairs at the second time is a 1 × 1 convolutional network, but the manner of connecting the convolutional network and the plurality of networks in the first speech recognition model at two times is different.

Wherein, the at least one feature extraction unit is connected with the plurality of networks in the first speech recognition model according to at least two connection modes: each feature extraction unit in the at least one feature extraction unit and each network in the first speech recognition model are connection objects, and after the connection objects are connected, at least two alternative speech recognition models are obtained, wherein each alternative speech recognition model corresponds to a connection mode.

703. And the electronic equipment selects a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performances of the at least two alternative speech recognition models.

In the embodiment of the application, the structure of the speech recognition model is not designed manually by a user, but a plurality of candidate speech recognition models are automatically created in a manner that at least one feature extraction network is connected with a plurality of existing networks in the first speech recognition model, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the obtained structure of the second speech recognition model can get rid of the limitation of human experience. In addition, the attention network is included in the second voice recognition model, so that the voice recognition performance of the voice recognition model can be improved by utilizing the attention mechanism when the second voice recognition model carries out voice recognition.

Fig. 8 is a flowchart of a speech recognition method according to an embodiment of the present application. Referring to fig. 8, the embodiment includes:

801. the electronic equipment acquires a first voice recognition model, wherein the first voice recognition model comprises a plurality of networks, the connection modes among the plurality of networks are not determined, and the plurality of networks comprise an input network, an attention network and an output network.

802. The electronic equipment selects at least one feature extraction unit from the feature extraction units at least once, and the at least one feature extraction unit is connected with the plurality of networks in the first voice recognition model according to at least two connection modes to obtain at least two alternative voice recognition models.

Optionally, at least one feature extraction unit selected each time by the electronic device is different, and therefore, by connecting the at least one feature extraction unit selected each time with the plurality of networks in the first speech recognition model according to at least two connection modes, the electronic device can obtain a plurality of different candidate speech recognition models.

In one possible implementation manner, the electronic device selects at least one feature extraction unit from a plurality of feature extraction units, and includes: the electronic equipment selects any quantity from the second quantity range; the electronic device selects the number of feature extraction units from the plurality of feature extraction units. Alternatively, the second number range is any number range, for example, the second number range is 1 to 5, which is not limited by the embodiment of the present application.

In one possible implementation manner, the electronic device selects the number of feature extraction units from a plurality of feature extraction units, and includes: the electronic equipment determines a plurality of unit sets corresponding to the number, wherein each unit set comprises the feature extraction units of the number; the electronic device selects each feature extraction unit in any unit set. Wherein each unit set corresponds to a combination form of the feature extraction units.

In the embodiment of the application, each unit set corresponds to one combination form of the feature extraction units, so that the feature extraction units are selected by using the unit sets, the combination forms of the feature extraction units selected each time can be different, and the alternative speech recognition models constructed based on the selected feature extraction units are different in structure.

The connection mode between the at least one feature extraction unit and the plurality of networks in the first speech recognition model includes a double-chain bi-chain-typed, a chain-typed, or a dense-connected, which is not limited in the embodiment of the present application.

In the embodiment of the application, multiple connection modes between the feature extraction unit and the network in the first speech recognition model are provided, so that after the at least one feature extraction unit is obtained, the multiple networks in the first speech recognition model and the selected at least one feature extraction unit can be connected in multiple connection modes, and multiple alternative speech recognition models with different structures are obtained, the number of the alternative speech recognition models is increased, and a second speech recognition model with higher recognition performance can be conveniently selected from the alternative speech recognition models.

In a possible implementation manner, at least one feature extraction unit is connected to a plurality of networks in the first speech recognition model at least once according to at least two connection manners, and before obtaining at least two candidate speech recognition models, the electronic device needs to acquire the feature extraction unit first, and the implementation manner is: the electronic device acquires at least one feature extraction unit based on a plurality of feature extraction networks, each acquired feature extraction unit including at least one feature extraction network.

In one possible implementation manner, the electronic device obtains at least one feature extraction unit based on a plurality of feature extraction networks, and includes: the electronic equipment selects a feature extraction network from the first network set, and determines the feature extraction network as a feature extraction unit; or the electronic equipment selects at least two feature extraction networks from the first network set, and connects the at least two feature extraction networks to obtain a feature extraction unit; wherein the first network set comprises a plurality of candidate feature extraction networks.

In the embodiment of the present application, the feature extraction unit is obtained based on the feature extraction network, that is, the internal structure of the feature extraction unit is searched, that is, when the structure of the speech recognition model is searched, the embodiment of the present application searches not only the macrostructure of the speech recognition model but also the microstructure inside the feature extraction unit, so that the candidate speech recognition models with richer structure types can be obtained, and the second speech recognition model with high speech recognition performance can be conveniently selected.

In one possible implementation, the electronic device selects at least two feature extraction networks from the first network set, including: the electronic equipment selects any quantity from a first quantity range, wherein the quantity in the first quantity range is not less than 2; the number of feature extraction networks is selected from the first set of networks.

In one possible implementation, the selecting the number of feature extraction networks from the first network set includes: the electronic equipment determines a plurality of second network sets corresponding to the number from the first network set, wherein each second network set corresponding to the number comprises the number of feature extraction networks; the electronic equipment selects each feature extraction network in a second network set corresponding to the quantity. It should be noted that the implementation of selecting at least one feature extraction network from the first set of networks is described in step 302 and will not be described in detail here.

In a possible implementation manner, after the electronic device selects at least two feature extraction networks from the first network set, the at least two feature extraction networks are connected to obtain the feature extraction unit, which includes: the electronic equipment connects at least two feature extraction networks in at least two connection modes to obtain at least two feature extraction units.

In the embodiment of the application, after at least two feature extraction networks are selected from the first network set, the at least two feature extraction networks are connected in at least two connection modes, so that a plurality of feature extraction units with different structures can be obtained, the structure types of the feature extraction units are expanded, the structures of the voice recognition models are searched based on the plurality of feature extraction units, the number of the alternative voice recognition models is expanded, and the second voice recognition model with higher recognition performance can be conveniently selected from the alternative voice recognition models.

In one possible implementation manner, the connection manner between at least two feature extraction networks includes a double chain bi-chain-typed, a chain-typed, or a dense densely-connected.

In the embodiment of the application, connection modes among multiple feature extraction networks are provided, so that after at least two feature extraction networks are obtained, the at least two feature extraction networks can be connected in multiple connection modes to obtain multiple feature extraction units with different structures, the structure types of the feature extraction units are expanded, the structures of the voice recognition models are searched based on the multiple feature extraction units, the number of the alternative voice recognition models is expanded, and a second voice recognition model with higher recognition performance is convenient to select from the alternative voice recognition models.

803. The electronic device determines a recognition performance of each alternative speech recognition model.

In one possible implementation, the electronic device determines a recognition performance of each alternative speech recognition model, including: the method comprises the steps that electronic equipment obtains a test set, wherein the test set comprises a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample voice.

It should be noted that, the implementation manner of determining the recognition performance of each candidate speech recognition model through the test set and the implementation manner of training each candidate speech recognition model through the first training set are already described in step 303, and are not described herein again.

804. And the electronic equipment selects a second speech recognition model for speech recognition from the at least two alternative speech recognition models according to the recognition performances of the at least two alternative speech recognition models.

In one possible implementation manner, after the electronic device selects, in response to obtaining the at least two candidate speech recognition models and according to the recognition performances of the at least two candidate speech recognition models, a second speech recognition model from the at least two candidate speech recognition models, the method further includes: the electronic equipment responds to the selection operation of the first feature extraction unit in the second voice recognition model, and creates a second feature extraction unit which is the same as the first feature extraction unit; and the electronic equipment adds the second feature extraction unit into the second voice recognition model and is connected with the first feature extraction unit to obtain the updated second voice recognition model. Optionally, the first feature extraction unit is any one of feature extraction units in the second speech recognition model.

Optionally, the electronic device adds the second feature extraction unit to the second speech recognition model, and is connected to the first feature extraction unit, and the implementation manner of obtaining the updated second speech recognition model is as follows: the electronic device inserts the second feature extraction unit between the first feature extraction unit and other networks or units, and the connection mode of the inserted second feature extraction unit and the networks or units of the upper layer and the lower layer is the same as the connection mode of the first feature extraction unit and the networks or units of the upper layer and the lower layer before insertion. Of course, the electronic device can also add the second feature extraction unit to the second speech recognition model in other ways, which is not limited by the embodiment of the present application. Optionally, the number of the second feature extraction units is any number, and the implementation manner of adding each second feature extraction unit in the second speech recognition model is the same.

805. The electronic device performs speech recognition based on the second speech recognition model.

In a possible implementation manner, in the process of performing speech recognition by the electronic device based on the second speech recognition model, the shape of the speech feature input to the attention network is C × T × F, which indicates that the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and C, T and F are positive integers; the process of the electronic equipment for carrying out voice recognition based on the attention network comprises the following steps: the electronic device transforms the shape of the speech feature to T x Z such that the transformed speech feature no longer contains a channel dimension and a frequency dimension, and the feature size in each time dimension is Z, where Z is the product of C and F; and determining attention weights corresponding to the voice features based on the converted voice features, restoring the shapes of the voice features after the weighting processing to C T F, and outputting the voice features after the shapes are restored.

In the embodiment of the application, the shape of the voice feature is transformed based on the attention network, so that the transformed voice feature does not include a channel dimension and a frequency dimension any more, and when the attention weight is generated based on the transformed voice feature, the attention weight is not limited to the voice feature in the channel, and the attention weight can be generated by combining the correlation between the channels of the voice feature, so that the generated attention weight is more accurate, the accuracy of the voice feature output by the attention network is improved, and the performance of voice recognition is improved.

In one possible implementation manner, before the electronic device performs the speech recognition based on the second speech recognition model, the method further includes: the electronic equipment acquires a second training set, wherein the second training set comprises third sample voice and third sample text corresponding to the third sample voice; and the electronic equipment respectively identifies the third sample voice based on each alternative voice identification model, and trains the second voice identification model according to the error between the text obtained by identification and the third sample text. The process of training the second speech recognition model by the electronic device through the second training set is the same as the process of training the alternative speech recognition model through the first training set, and is not repeated here.

In the embodiment of the application, the structure of the speech recognition model is not designed artificially by a user, but a plurality of candidate speech recognition models are automatically created by connecting at least one feature extraction unit with a plurality of existing networks in a first speech recognition model according to a plurality of connection modes, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the structure of the obtained second speech recognition model can get rid of the limitation of artificial experience. In addition, the attention network is included in the second voice recognition model, so that the voice recognition performance of the voice recognition model can be improved by utilizing the attention mechanism when the second voice recognition model carries out voice recognition.

In the embodiment of the application, the recognition performance of each candidate speech recognition model is determined by using the test set, so that a second speech recognition model with good speech recognition performance can be conveniently selected from the candidate speech recognition models, and the recognition performance of the second speech recognition model for speech recognition is ensured.

In the embodiment of the application, the feature extraction network which is the same as the existing feature extraction network is added in the obtained second voice recognition model, so that the depth of the second voice recognition model can be increased, and the recognition performance of the second voice recognition model is further improved.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 9 is a block diagram of a speech recognition apparatus according to an embodiment of the present application. Referring to fig. 9, the embodiment includes:

a model obtaining module 91, configured to obtain a first speech recognition model, where the first speech recognition model includes an input network, a first feature extraction unit, and an output network, where connection modes between the input network, the first feature extraction unit, and the output network are determined, and the first feature extraction unit includes an attention network;

a network adding module 92, configured to add at least one feature extraction network to the first feature extraction unit at least once, and connect with the attention network to obtain an alternative speech recognition model;

and the model selecting module 93 is configured to, in response to obtaining the at least two candidate speech recognition models, select, according to the recognition performance of the at least two candidate speech recognition models, a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models.

In a possible implementation manner, the model obtaining module 91 is configured to connect the plurality of first feature extraction units according to a double-chain bi-chain-structured connection manner, a chain-structured connection manner, or a dense-connected connection manner, so as to obtain a unit chain; and respectively connecting an input network and an output network at two ends of the unit chain to obtain a first voice recognition model.

In a possible implementation manner, the network adding module 92 is configured to add at least one feature extraction network to the first feature extraction unit in a different manner, and connect with the attention network to obtain different candidate speech recognition models.

In a possible implementation manner, the network adding module 92 is configured to add at least one feature extraction network to the first feature extraction unit, and connect the feature extraction network and the attention network in a double-chain bi-chain-structured connection manner, a chain-structured connection manner, or a dense-connected connection manner to obtain the candidate speech recognition model.

In one possible implementation manner, the first speech recognition model comprises a plurality of first feature extraction units, and the connection manner among the plurality of first feature extraction units is determined; the connection mode of the plurality of networks in the first feature extraction unit is different from the connection mode between the plurality of first feature extraction units.

In one possible implementation manner, the first speech recognition model includes N-1 first feature extraction units and N unit groups, each unit group includes M second feature extraction units, N is an integer greater than 1, M is a positive integer, the second feature extraction unit does not include an attention network, and a connection manner of the network in the first speech recognition model is: the two ends of the first speech recognition model are an input network and an output network, a unit group is connected behind the input network, a unit group is connected in front of the output network, and a first feature extraction unit is connected between every two unit groups.

In one possible implementation, referring to fig. 10, the apparatus further includes:

a model updating module 94 for creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model; and adding the fourth feature extraction unit into the second voice recognition model, and connecting the fourth feature extraction unit with the third feature extraction unit to obtain the updated second voice recognition model.

In a possible implementation manner, in the process of performing speech recognition based on the second speech recognition model, the shape of the speech feature input to the attention network is C × T × F, which indicates that the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and C, T and F are positive integers;

the process of performing voice recognition based on the attention network comprises the following steps:

transforming the shape of the speech feature into T x Z, such that the transformed speech feature no longer contains a channel dimension and a frequency dimension, and the feature size in each time dimension is Z, wherein Z is the product of C and F;

and determining attention weight corresponding to the voice feature based on the converted voice feature, performing weighting processing on the converted voice feature based on the attention weight, recovering the shape of the voice feature after the weighting processing to C T F, and outputting the voice feature after the shape recovery.

In one possible implementation, referring to fig. 10, the network adding module 92 includes:

a network selection sub-module 921 for selecting at least one feature extraction network from the first network set at least once;

a network adding submodule 922, configured to add at least one feature extraction network to the first feature extraction unit, and connect with the attention network to obtain an alternative speech recognition model;

In a possible implementation, the network selection sub-module 921, referring to fig. 10, includes:

a number selecting unit 9211 configured to select any one number from the first number range;

a network selecting unit 9212 is configured to select a number of feature extraction networks from the first network set.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selecting unit 9212 is configured to determine a plurality of second network sets corresponding to the number from the first network set, where each second network set corresponding to the number includes a number of feature extraction networks; and selecting each feature extraction network in a second network set corresponding to the number.

In a possible implementation manner, the first speech recognition model further comprises a second feature extraction unit, the second feature extraction unit does not comprise an attention network, and the connection manner among the input network, the first feature extraction unit, the second feature extraction unit and the output network is determined; the network adding module 92 is further configured to add at least one feature extraction network to the second feature extraction unit at least once to obtain the candidate speech recognition model.

a performance determining module 95, configured to obtain a test set, where the test set includes a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample voice.

a first training module 96, configured to obtain a first training set, where the first training set includes a second sample voice and a second sample text corresponding to the second sample voice; and respectively recognizing the second sample voice based on each alternative voice recognition model, and training each alternative voice recognition model according to the error between the text obtained by recognition and the second sample text.

and a speech recognition module 97, configured to perform speech recognition based on the second speech recognition model.

In one possible implementation, the apparatus further includes:

a second training module 98, configured to obtain a second training set, where the second training set includes a third sample voice and a third sample text corresponding to the third sample voice; and respectively recognizing the third sample voice based on each alternative voice recognition model, and training the second voice recognition model according to the error between the recognized text and the third sample text.

Fig. 11 is a block diagram of a speech recognition apparatus according to an embodiment of the present application. Referring to fig. 11, the embodiment includes:

the model obtaining module 111 is configured to obtain a first speech recognition model, where the first speech recognition model includes multiple networks, and a connection manner between the multiple networks is not determined, and the multiple networks include an input network, an attention network, and an output network;

a unit connection module 112, configured to connect at least one feature extraction unit with multiple networks in the first speech recognition model according to at least two connection manners at least once, so as to obtain at least two alternative speech recognition models;

and the model selecting module 113 is configured to select a second speech recognition model for performing speech recognition from the at least two candidate speech recognition models according to the recognition performance of the at least two candidate speech recognition models.

In one possible implementation, referring to fig. 12, an apparatus includes:

a unit obtaining module 114, configured to obtain at least one feature extraction unit based on a plurality of feature extraction networks, where each obtained feature extraction unit includes at least one feature extraction network.

In one possible implementation, the unit connection module 112, referring to fig. 12, includes:

a unit selecting submodule 1121 configured to select at least one feature extraction unit from the plurality of feature extraction units at least once;

the unit connection sub-module 1122 is configured to connect the multiple networks in the first speech recognition model with the selected at least one feature extraction unit according to at least two connection manners, so as to obtain at least one candidate speech recognition model.

In one possible implementation, the unit selection sub-module 1121, referring to fig. 12, includes:

a first number selecting unit 11211 configured to select any one number from the second number range;

a unit selecting unit 11212, configured to select a number of feature extraction units from the plurality of feature extraction units.

In a possible implementation manner, the unit selecting unit 11212 is configured to determine a plurality of unit sets corresponding to the number, where each unit set includes a number of feature extraction units; and selecting each feature extraction unit in any unit set.

In one possible implementation, the unit obtaining module 114, referring to fig. 12, includes:

a first unit obtaining sub-module 1141, configured to select a feature extraction network from the first network set, and determine the feature extraction network as a feature extraction unit; alternatively, the first and second electrodes may be,

a second unit obtaining sub-module 1142, configured to select at least two feature extraction networks from the first network set, and connect the at least two feature extraction networks to obtain a feature extraction unit;

In one possible implementation, the second unit obtaining sub-module 1142 includes:

a second number selecting unit 11421 configured to select any one number from a first number range, where the number in the first number range is not less than 2;

a network selecting unit 11422, configured to select a number of feature extraction networks from the first set of networks.

In a possible implementation manner, the first network set includes a plurality of different second network sets, and the network selecting unit 11422 is configured to determine, from the first network set, a plurality of second network sets corresponding to the number, where each second network set corresponding to the number includes a number of feature extraction networks; and selecting each feature extraction network in a second network set corresponding to the number.

In a possible implementation manner, the second unit obtaining sub-module 1142 is configured to connect at least two feature extraction networks in at least two connection manners, so as to obtain at least two feature extraction units.

In one possible implementation, the connection between at least two feature extraction networks includes a double chain bi-chain-typed, a chain-typed, or a dense densely-connected.

In one possible implementation, referring to fig. 12, the apparatus further includes:

a performance determining module 115, configured to obtain a test set, where the test set includes a first sample voice and a first sample text corresponding to the first sample voice; and respectively recognizing the first sample voice based on each alternative voice recognition model, and determining the recognition performance of each alternative voice recognition model according to the text obtained by recognition and the first sample voice.

a first training module 116, configured to obtain a first training set, where the first training set includes a second sample voice and a second sample text corresponding to the second sample voice; and respectively recognizing the second sample voice based on each alternative voice recognition model, and training each alternative voice recognition model according to the error between the text obtained by recognition and the second sample text.

and a speech recognition module 117, configured to perform speech recognition based on the second speech recognition model.

In one possible implementation, the apparatus further includes:

a second training module 118, configured to obtain a second training set, where the second training set includes third sample voices and third sample texts corresponding to the third sample voices; and respectively recognizing the third sample voice based on each alternative voice recognition model, and training the second voice recognition model according to the error between the recognized text and the third sample text.

In one possible implementation, the apparatus further includes:

a model update module 119 for creating a second feature extraction unit identical to the first feature extraction unit in response to a selection operation of the first feature extraction unit in the second speech recognition model; and adding the second feature extraction unit into the second speech recognition model, and connecting the second feature extraction unit with the first feature extraction unit to obtain an updated second speech recognition model.

In the embodiment of the application, the structure of the speech recognition model is not designed artificially by a user, but a plurality of candidate speech recognition models are automatically created by connecting at least one feature extraction unit with a plurality of existing networks in a first speech recognition model according to a plurality of connection modes, and then a required second speech recognition model is selected from the candidate speech recognition models according to the recognition performance, so that the structure of the obtained second speech recognition model can get rid of the limitation of artificial experience. In addition, the attention network is included in the second voice recognition model, so that the recognition performance of the voice recognition model can be improved by utilizing the attention mechanism when the second voice recognition model carries out voice recognition.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing speech recognition, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the electronic device may be divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The embodiment of the present application further provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor, so as to implement the operations performed in the voice recognition method of the foregoing embodiment.

Optionally, the electronic device is provided as a terminal. Fig. 13 shows a block diagram of a terminal 1300 according to an exemplary embodiment of the present application. The terminal 1300 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Terminal 1300 includes: a processor 1301 and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1302 is used to store at least one computer program for execution by the processor 1301 to implement the speech recognition methods provided by the method embodiments herein.

In some embodiments, terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of an audio circuit 1304 and a power supply 1305.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

Audio circuitry 1304 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1300. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1304 may also include a headphone jack.

Power supply 1305 is used to provide power to various components in terminal 1300. The power supply 1305 may be alternating current, direct current, disposable or rechargeable. When the power supply 1305 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the configuration shown in fig. 13 is not intended to be limiting with respect to terminal 1300 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

Optionally, the electronic device is provided as a server. Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the memory 1402 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1401 to implement the speech recognition methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where at least one computer program is stored in the computer-readable storage medium, and the at least one computer program is loaded and executed by a processor to implement the operations performed in the speech recognition method of the foregoing embodiment.

Embodiments of the present application also provide a computer program product or a computer program, which includes a computer program, and the computer program is stored in a computer readable storage medium. The processor of the electronic device reads the computer program from the computer-readable storage medium, and executes the computer program, so that the electronic device performs the operations performed in the voice recognition method in the various alternative implementations described above.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein obtaining the first speech recognition model comprises:

connecting a plurality of first feature extraction units according to a double-chain bi-chain-typed connection mode, a chain type chain-typed connection mode or a dense desely-connected connection mode to obtain a unit chain;

and respectively connecting the input network and the output network at two ends of the unit chain to obtain the first voice recognition model.

3. The method of claim 1, wherein said at least once adding at least one feature extraction network to said first feature extraction unit and connecting with said attention network to obtain an alternative speech recognition model comprises:

and adding the at least one feature extraction network to the first feature extraction unit in different modes, and connecting the at least one feature extraction network with the attention network to obtain different alternative speech recognition models.

4. The method of claim 1, wherein adding at least one feature extraction network to the first feature extraction unit and connecting with the attention network to obtain an alternative speech recognition model comprises:

and adding the at least one feature extraction network into the first feature extraction unit, and connecting the at least one feature extraction network with the attention network according to a double-chain bi-chain-typed connection mode, a chain-typed connection mode or a dense desely-connected connection mode to obtain the alternative speech recognition model.

5. The method according to claim 1, wherein the first speech recognition model comprises a plurality of the first feature extraction units, and a connection mode between the plurality of the first feature extraction units is determined; the connection mode of a plurality of networks in the first feature extraction unit is different from the connection mode of a plurality of first feature extraction units.

6. The method according to claim 1, wherein the first speech recognition model comprises N-1 first feature extraction units and N unit groups, each unit group comprises M second feature extraction units, N is an integer greater than 1, M is a positive integer, the second feature extraction units do not include the attention network, and the network in the first speech recognition model is connected in a manner that: the two ends of the first speech recognition model are the input network and the output network, one unit group is connected behind the input network, one unit group is connected in front of the output network, and one first feature extraction unit is connected between every two unit groups.

7. The method according to any of claims 1-6, wherein after, in response to obtaining at least two candidate speech recognition models, selecting a second speech recognition model for speech recognition from the at least two candidate speech recognition models according to the recognition performance of the at least two candidate speech recognition models, the method further comprises:

creating a fourth feature extraction unit identical to the third feature extraction unit in response to a selection operation of the third feature extraction unit in the second speech recognition model;

and adding the fourth feature extraction unit into the second voice recognition model, and connecting the fourth feature extraction unit with the third feature extraction unit to obtain the updated second voice recognition model.

8. The method according to any one of claims 1 to 6, wherein in the process of performing speech recognition based on the second speech recognition model, the shape of the speech feature input to the attention network is C x T x F, which indicates that the number of channel dimensions included in the speech feature is C, the number of time dimensions is T, the number of frequency dimensions is F, and C, T and F are positive integers;

9. A method of speech recognition, the method comprising:

10. The method of claim 9, wherein the connection means comprises a double chain bi-chain-typed, a chain-typed, or a dense-connected.

11. The method according to claim 9, wherein before said at least once connecting at least one feature extraction unit to a plurality of networks in the first speech recognition model in at least two connection modes to obtain at least two candidate speech recognition models, the method comprises:

at least one feature extraction unit is obtained based on a plurality of feature extraction networks, each obtained feature extraction unit comprising at least one feature extraction network.

12. A speech recognition apparatus, characterized in that the apparatus comprises:

13. A speech recognition apparatus, characterized in that the apparatus comprises:

14. An electronic device, comprising a processor and a memory, wherein at least one computer program is stored in the memory, and wherein the computer program is loaded and executed by the processor to perform the operations performed by the speech recognition method according to any of claims 1 to 11.

15. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to perform the operations performed by the speech recognition method according to any one of claims 1 to 11.