CN113823262A

CN113823262A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113823262A
Application number: CN202111352684.8A
Authority: CN
Inventors: 颜京豪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2021-12-21
Anticipated expiration: 2041-11-16
Also published as: CN113823262B

Abstract

The application relates to the technical field of voice recognition, in particular to a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like and are used for efficiently and accurately realizing voice recognition of a multi-dialect target language. The method comprises the following steps: acquiring voice data to be recognized of a target language; extracting voice acoustic characteristics corresponding to each frame of voice data in the voice data to be recognized; performing depth feature extraction on the voice acoustic features to obtain corresponding dialect embedding features; obtaining corresponding acoustic coding features by coding the acoustic features of the voice; and carrying out dialect voice recognition on the voice data to be recognized based on the dialect embedding characteristics and the acoustic coding characteristics to obtain target text information and a target dialect category corresponding to the voice data to be recognized. The method and the device combine dialect embedding characteristics and acoustic coding characteristics to comprehensively learn, and can efficiently and accurately realize the speech recognition of recognizing various dialects.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of scientific technology, related services based on voice recognition technology have been widely applied to people's daily life and work, such as smart speakers, vehicle-mounted systems, etc.

In the related art, the research work of voice recognition mainly focuses on some general languages with abundant data resources, and the voice data volume slowly breaks through ten thousand or even one hundred thousand hours. However, the speech recognition of the target language (such as the Tibetan language) with less data resources is limited by the data resources and the language propagation range, and the speech recognition of the target language is relatively less researched.

The traditional modeling method needs to respectively construct modules such as an acoustic model, a pronunciation dictionary and a language model, and due to the fact that the corpus resources of a target language are scarce, a large amount of target language speech data are difficult to record, the corpus scale is small, the pronunciation dictionary is difficult to construct, and relevant research is concentrated in a single dialect. In addition, the coverage of the pronunciation phenomenon is low, the balance degree is low, and the recognition rate of the acoustic model obtained by adopting the corpus training is also low.

Therefore, how to efficiently and accurately realize the speech recognition of the target language of the dialect is a critical solution.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for efficiently and accurately realizing voice recognition of a multi-dialect target language.

The voice recognition method provided by the embodiment of the application comprises the following steps:

acquiring voice data to be recognized of a target language, wherein the target language comprises a plurality of dialects;

after the voice data to be recognized is subjected to framing processing, extracting voice acoustic characteristics corresponding to each frame of voice data in the voice data to be recognized;

performing deep feature extraction on the voice acoustic features to obtain dialect embedding features corresponding to the voice data to be recognized; the voice acoustic features are coded to obtain acoustic coding features corresponding to the voice data to be recognized, and the dialect embedding features are used for representing dialect information of a dialect to which the voice data to be recognized belongs;

and carrying out dialect voice recognition on the voice data to be recognized based on the dialect embedding characteristics and the acoustic coding characteristics to obtain target text information and a target dialect category corresponding to the voice data to be recognized.

An embodiment of the present application provides a speech recognition apparatus, including:

the voice acquisition unit is used for acquiring voice data to be recognized of a target language, and the target language comprises a plurality of dialects;

the first extraction unit is used for extracting the voice acoustic characteristics corresponding to each frame of voice data in the voice data to be recognized after the voice data to be recognized is subjected to framing processing;

the second extraction unit is used for extracting the depth features of the voice acoustic features to obtain dialect embedded features corresponding to the voice data to be recognized; the voice acoustic features are coded to obtain acoustic coding features corresponding to the voice data to be recognized, and the dialect embedding features are used for representing dialect information of a dialect to which the voice data to be recognized belongs;

and the voice recognition unit is used for carrying out dialect voice recognition on the voice data to be recognized based on the dialect embedding characteristics and the acoustic coding characteristics to obtain target text information and a target dialect category corresponding to the voice data to be recognized.

Optionally, the second extracting unit is specifically configured to:

based on a dialect recognition network in a trained multi-party speech sound recognition model, performing deep feature extraction on the voice acoustic features to obtain dialect embedding features corresponding to the voice data to be recognized;

the obtaining of the acoustic coding feature corresponding to the voice data to be recognized by coding the voice acoustic feature includes:

and performing dimension-increasing coding on the voice acoustic characteristics based on a voice recognition network in the multi-party speech recognition model to obtain the acoustic coding characteristics corresponding to the voice data to be recognized.

Optionally, the speech recognition unit is specifically configured to:

performing high-dimensional feature extraction on the dialect embedded features through a feedforward neural network in the multi-party speech sound recognition model to obtain dialect depth features corresponding to the voice data to be recognized;

inputting the dialect depth feature into the voice recognition network, and combining and splicing the dialect depth feature and the acoustic coding feature based on the voice recognition network to obtain a splicing feature corresponding to the voice data to be recognized;

and predicting based on the splicing characteristics to obtain target text information corresponding to the voice data to be recognized and the target dialect category.

Optionally, the voice recognition network includes a backbone classification sub-network and an auxiliary classification sub-network; the speech recognition unit is specifically configured to:

inputting the splicing features into the trunk classification sub-network and the auxiliary classification sub-network respectively for decoding to obtain candidate results output by the trunk classification sub-network and the auxiliary classification sub-network respectively, wherein each candidate result comprises candidate text information and candidate dialect categories aiming at the voice data to be recognized;

and selecting one from the candidate results, and taking the candidate text information and the candidate dialect category contained in the selected candidate result as the target text information and the target dialect category corresponding to the voice data to be recognized respectively.

inputting the splicing features into the auxiliary classification sub-network for decoding, and obtaining a plurality of candidate results output by the auxiliary classification sub-network, wherein each candidate result comprises candidate text information and a candidate dialect category aiming at the voice data to be recognized;

and respectively inputting each candidate result into the trunk classification sub-network for decoding to obtain target text information and a target dialect category corresponding to the voice data to be recognized.

Optionally, the speech recognition unit is specifically configured to:

inputting each candidate result into the trunk classification sub-network for decoding to obtain an evaluation value of each candidate result;

and selecting one of the candidate results based on the evaluation values, and taking the candidate text information and the candidate dialect category contained in the selected candidate result as the target text information and the target dialect category corresponding to the voice data to be recognized respectively.

Optionally, the dialect identifying network includes a depth feature extraction sub-network, a time-sequence pooling sub-network and a dialect classification sub-network; the second extraction unit is specifically configured to:

respectively inputting the voice acoustic features of each frame of voice data into the depth feature extraction sub-network, and acquiring the frame-level depth features corresponding to each frame of voice data;

integrating each frame level depth feature related to time sequence through a time sequence pooling sub-network to obtain a sentence level feature vector with fixed dimensionality corresponding to the voice data to be recognized;

and performing dimensionality reduction processing on the sentence-level feature vector to obtain the dialect embedding feature.

Optionally, the apparatus further comprises:

the model training unit is used for obtaining the multi-party speech sound recognition model through the following training modes:

according to training samples in a training sample data set, performing cycle iterative training on the multi-party speech sound recognition model, and outputting the multi-party speech sound recognition model after training is completed, wherein the training sample data set comprises training samples corresponding to various dialects contained in the target language; wherein the following operations are executed in a loop iteration training process:

selecting training samples from the training sample data set, inputting the selected training samples into the multi-party speech sound recognition model, and acquiring predicted dialect categories and predicted text information output by the multi-party speech sound recognition model;

and adjusting parameters of the multi-party speech sound recognition model based on the predicted dialect category, the predicted text information and the real text information.

Optionally, the voice recognition network includes a backbone classification sub-network and an auxiliary classification sub-network; the model training unit is specifically configured to:

obtaining a first prediction dialect category output by the dialect identification network; and

and acquiring a second prediction dialect category and prediction text information output by the main classification sub-network and the auxiliary classification sub-network respectively.

Optionally, each training sample includes sample voice data, and a real dialect category and real text information corresponding to the sample voice data;

the model training unit is specifically configured to:

performing parameter adjustment on the dialect recognition network based on a difference between a true dialect class and the first predicted utterance class of sample speech data in the training sample; and

and adjusting parameters of the speech recognition network based on the difference between the real dialect category and each second prediction dialect category and the difference between the real text information and each prediction text information of the sample speech data in the training sample.

Optionally, the model training unit is specifically configured to:

determining a first loss function corresponding to the backbone classification subnetwork based on a difference between a second predicted dialect class and the real dialect class output by the backbone classification subnetwork and a difference between the predicted text information and the real text information; and

determining a second loss function corresponding to the auxiliary classification subnetwork based on a difference between a second predicted dialect class output by the auxiliary classification subnetwork and the real dialect class and a difference between the second predicted text information and the real text information;

and adjusting parameters of the voice recognition network based on the first loss function and the second loss function.

Optionally, the initial network parameters of the speech recognition network are obtained by performing parameter migration based on a standard speech recognition model, and the standard speech recognition model is obtained by performing training based on sample standard speech data.

An electronic device provided by an embodiment of the present application includes a processor and a memory, where the memory stores program codes, and when the program codes are executed by the processor, the processor is caused to execute any one of the steps of the voice recognition method.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the speech recognition methods described above.

An embodiment of the present application provides a computer-readable storage medium, which includes program code for causing an electronic device to perform any one of the steps of the voice recognition method described above when the program product runs on the electronic device.

The beneficial effect of this application is as follows:

the embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium. According to the speech recognition method in the embodiment of the application, when dialect speech recognition is performed on speech data to be recognized, firstly, speech acoustic features of each frame of speech data in the speech data to be recognized are extracted, then, deep learning and coding are further performed on the basis of the speech acoustic features corresponding to each frame, dialect embedding features used for representing the dialect information of the speech data to be recognized and acoustic coding features used for representing the acoustic features of the speech data to be recognized are respectively obtained, and then, dialect speech recognition is performed on the speech data to be recognized on the basis of the features, so that target text information and target dialect types corresponding to the speech data to be recognized are obtained. The method is based on the characteristics of the target language and the dialect, learning of the dialect embedding characteristics is added, the dialect embedding characteristics and the acoustic coding characteristics are combined, more dialect information is utilized for dialect voice recognition, the dialect type of the voice data to be recognized of the target language is further output while the voice data to be recognized is converted into text information, and the dialect type is not recognized directly based on the voice acoustic characteristics, so that the voice recognition of multiple dialects can be realized efficiently and accurately.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is an alternative schematic diagram of a speech recognition system of the related art;

fig. 2 is an alternative schematic diagram of an application scenario in an embodiment of the present application;

FIG. 3 is a flowchart illustrating an implementation of a speech recognition method according to an embodiment of the present application;

fig. 4 is a flowchart of an implementation of a depth feature extraction method in an embodiment of the present application;

FIG. 5 is a schematic diagram of a sequence-to-sequence end-to-end network structure in an embodiment of the present application;

FIG. 6 is a diagram illustrating a structure of a strong-coupling-based multi-task learning end-to-end model in an embodiment of the present application;

fig. 7 is a schematic diagram of a model structure after dialect embedding is added in the embodiment of the present application;

FIG. 8 is a flow chart of an implementation of a model training method in an embodiment of the present application;

FIG. 9 is a flowchart illustrating an implementation of a model parameter updating method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a Tibetan multi-directional speech recognition model based on transfer learning in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device in an embodiment of the present application;

fig. 13 is a schematic structural diagram of another electronic device in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

Some concepts related to the embodiments of the present application are described below.

Transfer learning: refers to the transfer of the parameters of the trained model (pre-trained model) to the new model to help the new model training. Considering that most data or tasks are relevant, through migration learning, the learned model parameters (also understood as knowledge learned by the model) can be shared with the new model in a certain way so as to accelerate and optimize the learning efficiency of the model without learning from zero as in most networks.

Feed Forward Neural Network (FNN): the feedforward network is one kind of artificial neural network. In such a neural network, each neuron receives an input of a previous stage from an input layer and inputs to a next stage up to an output layer. No feedback exists in the whole network, and a directed acyclic graph can be used for representing the feedback. The feedforward neural network can be divided into a single-layer feedforward neural network and a multi-layer feedforward neural network according to the different layers of the feedforward neural network. Common feedforward neural networks include perceptron (perceptron), bp (back propagation) network, rbf (radial Basis function) network, and the like.

The multi-party speech recognition model comprises: the model is a model newly proposed by the present application, and is used for recognizing speech data of a target language including multiple dialects, dialect speech recognition can be performed on the speech data to be recognized of the target language based on the model, and text information corresponding to the speech data to be recognized, that is, target text information in the text, and dialects included in speech in the speech data to be recognized, that is, target dialect categories in the text are determined.

Linear Discriminant Analysis (LDA): also known as Fisher Linear Discriminant (FLD), is a classical algorithm for pattern recognition. The basic idea is to project high-dimensional pattern samples to an optimal identification vector space to achieve the effects of extracting classification information and compressing the dimension of a feature space, and after projection, the pattern samples are ensured to have the maximum inter-class distance and the minimum intra-class distance in a new subspace, namely, the pattern has the optimal separability in the space.

Connection Timing Classification (CTC): is a tool for sequence modeling, and the core of the tool is to define a special objective function/optimization criterion. The method is used for solving the classification problem of time series data. Such as speech Recognition, Optical Character Recognition (OCR), etc.

Embodiments of the present application relate to Artificial Intelligence (AI) and Machine Learning (ML) technologies, which are designed based on computer vision technology and Machine Learning in Artificial Intelligence.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. With the research and progress of artificial intelligence technology, artificial intelligence is researched and applied in a plurality of fields, such as common smart homes, smart customer service, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, robots, smart medical treatment and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Compared with the method for finding mutual characteristics among big data by data mining, the machine learning focuses on the design of an algorithm, so that a computer can automatically learn rules from the data and predict unknown data by using the rules.

Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. The multi-party speech sound recognition model in the embodiment of the application is obtained by training through a machine learning or deep learning technology. The voice recognition method based on the multi-party speech sound recognition model in the embodiment of the application can be used for updating parameters of models such as text processing, voice processing, semantic understanding, machine translation, robot question answering and knowledge graph spectrum.

The speech recognition method provided in the embodiment of the application mainly comprises two parts, namely model training and model application. The model training part is used for training the multi-party speech sound recognition model through the machine learning technology, after model training is completed, the trained multi-party speech sound recognition model can be used for performing speech recognition of a target language comprising multiple dialects by using the training method based on the application, target text information and target dialect categories corresponding to speech data to be recognized are obtained, and further follow-up services such as speech-to-text services and speech synthesis services can be performed based on the obtained recognition results.

The following briefly introduces the design concept of the embodiments of the present application:

automatic Speech Recognition (ASR) is an active research topic in the field of artificial intelligence. The speech recognition purpose is to convert speech signals into corresponding text representations, the basic framework of which is shown in fig. 1. The voice signal firstly needs to be subjected to acoustic feature extraction, information is greatly compressed and converted into a form which can be better divided by a machine, and then the features are sent to a decoder to decode a recognition result. The decoder needs the combined action of the acoustic model, the language model and the pronunciation dictionary to score the features to obtain the final decoding sequence.

Related services based on voice recognition technology have been widely applied to daily life and work of people (such as smart speakers, vehicle-mounted systems, etc.), but due to the limitations of data resources and language propagation range, the development of voice recognition technology of some target languages is relatively slow. However, because the target language is used in a smaller range and lacks data support required for research, there is less research associated with speech recognition for the target language.

In the traditional modeling method, in order to realize voice recognition, modules such as an acoustic model, a pronunciation dictionary, a language model and the like need to be respectively constructed, because the corpus resources of a target language are scarce, a large amount of target language speech data is difficult to record, the corpus scale is small, the pronunciation dictionary is difficult to construct, and related research is concentrated in a single dialect. In addition, the coverage of the pronunciation phenomenon is low, the balance degree is low, and the recognition rate of the acoustic model obtained by adopting the corpus training is also low.

In view of this, embodiments of the present application provide a speech recognition method, apparatus, electronic device, and storage medium. According to the speech recognition method in the embodiment of the application, when dialect speech recognition is performed on speech data to be recognized, firstly, speech acoustic features of each frame of speech data in the speech data to be recognized are extracted, then, deep learning and coding are further performed on the basis of the speech acoustic features corresponding to each frame, dialect embedding features used for representing the dialect information of the speech data to be recognized and acoustic coding features used for representing the acoustic features of the speech data to be recognized are respectively obtained, and then, dialect speech recognition is performed on the speech data to be recognized on the basis of the features, so that target text information and target dialect types corresponding to the speech data to be recognized are obtained. The method and the device have the advantages that the learning of the dialect embedding features is added based on the characteristics of the target language dialect, the dialect embedding features and the acoustic coding features are combined, more dialect information is utilized to carry out dialect voice recognition, the to-be-recognized voice data of the target language are converted into text information, the dialect type to which the to-be-recognized voice data of the target language belong is further output, and the voice recognition of multiple dialects is efficiently and accurately realized.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 2 is a schematic view of an application scenario according to an embodiment of the present application. The application scenario diagram includes two terminal devices 210 and a server 220.

The application scenario diagram includes two terminal devices 210 and a server 220. The terminal device 210 in the embodiment of the present application may have a client installed thereon. The client may be software (e.g., a browser, an instant messaging application), or a web page, an applet, etc., and the server 220 is a background server corresponding to the software, or the web page, the applet, etc., which is not limited in this application.

For example, a user a may log in an instant messaging application through the terminal device 210, a user B may also log in the instant messaging application through another terminal device 210, and the user a and the user B may send a voice message based on the instant messaging application. For example, when a voice message in a target language is sent between the user a and the user B, the dialect category corresponding to the voice message sent by the user may be identified based on the multi-dialect voice recognition method in the embodiment of the present application, and the voice message may be converted into a text message.

It should be noted that the above is only a simple example, and the embodiments of the present application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart traffic, assisted driving, and the like.

In an alternative embodiment, the terminal device 210 and the server 220 may communicate with each other via a communication network.

In an alternative embodiment, the communication network is a wired network or a wireless network.

In this embodiment, the terminal device 210 is a computer device used by a user, and the computer device may be a computer device having a certain computing capability and running instant messaging software and a website or social contact software and a website, such as a personal computer, a mobile phone, a tablet computer, a notebook, an electronic book reader, an intelligent voice interaction device, an intelligent appliance, and a vehicle-mounted terminal. Each terminal device 210 is connected to a server 220 through a wireless network, and the server 220 is a server or a server cluster or a cloud computing center formed by a plurality of servers, or is a virtualization platform.

Optionally, in this embodiment of the application, the multi-party speech recognition model may be deployed on the terminal device 210 for training, or may be deployed on the server 220 for training. The server 220 may store a large number of training samples including sample voice data corresponding to various dialects in the target language, and is used for training the multi-party speech sound recognition model. Optionally, after the multi-party speech recognition model is obtained by training based on the training method in the embodiment of the present application, the trained multi-party speech recognition model may be directly deployed on the server 220 or the terminal device 210. Generally, the multi-party speech sound recognition model is directly deployed on the server 220, and in the embodiment of the application, the multi-party speech sound recognition model is often used for dialect speech recognition to obtain recognized text information and dialect categories, so that subsequent services, such as speech-to-text service, speech synthesis service and the like, can be further performed.

It should be noted that the speech recognition method in the embodiment of the present application may be executed by the server or the terminal device alone, or may be executed by both the server and the terminal device. The example illustrated here is primarily a server-independent implementation, with the general model training process being performed solely by server 220.

It should be noted that fig. 2 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.

The speech recognition method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

Referring to fig. 3, an implementation flow chart of a speech recognition method provided in the embodiment of the present application is shown, and the following mainly takes an execution subject as an example to illustrate, where the implementation flow of the method is as follows:

s31: the method comprises the steps that a server obtains voice data to be recognized of a target language, wherein the target language comprises a plurality of dialects;

the target language is a language containing multiple dialects, the target language can be a language with less data resources, such as some languages, the data acquisition difficulty of the languages in the languages, and the resources cannot be compared with the resources in the languages to be recognized, so that the text information and the dialects corresponding to the voices are obtained.

The following description mainly takes the target language as the Tibetan language as an example, and the Tibetan language can be divided into Tibetan language, Anduo dialect and Kanba dialect according to the main dialect area. When the speech data to be recognized acquired in step S31 is Tibetan, the speech data to be recognized may include at least one Tibetan dialect.

S32: after framing processing is carried out on the voice data to be recognized by the server, voice acoustic characteristics corresponding to each frame of voice data in the voice data to be recognized are extracted;

in this embodiment of the present application, frame division processing needs to be performed on voice data to be recognized to obtain each frame of voice data, and then acoustic features of each frame of voice data are extracted to obtain respective corresponding voice acoustic features of each frame of voice data, and further feature processing is performed based on the features extracted in this step, which is specifically referred to in step S33.

The process of extracting the acoustic features of each frame of voice data can be the same as the method of the related technology, and the step mainly comprises the step of greatly compressing the voice information and converting the voice information into a form which can be better divided by a machine.

For example, the acoustic features can be directly generated from speech through operations of spectral framing, time-Frequency conversion, filtering, etc., and commonly used features include Mel-Frequency Cepstral Coefficients (MFCC), Filter bank, fbbank, etc.

In the related art, after the speech acoustic features corresponding to each frame of speech data are extracted, the features are directly sent to a decoder to decode the recognition result, as shown in fig. 1. In the embodiment of the present application, the speech acoustic features of each frame of speech data are not directly sent to the decoder to decode the recognition result, but the dialect information and the acoustic information are further learned for the speech acoustic features, which is specifically referred to as step S33.

S33: the server obtains dialect embedding characteristics corresponding to the voice data to be recognized by performing deep characteristic extraction on the voice acoustic characteristics; the voice acoustic features are coded to obtain acoustic coding features corresponding to the voice data to be recognized, and the dialect embedding features are used for representing dialect information of a dialect to which the voice data to be recognized belongs;

the step is mainly divided into two sub-steps, one of which is as follows: carrying out depth feature extraction on the voice acoustic features; the second step is as follows: the speech acoustic features are encoded. These two substeps may be implemented based on a machine learning approach. The embodiment of the application provides a multi-party speech sound recognition model obtained based on machine learning mode training, and the model comprises a dialect recognition network and a voice recognition network.

An alternative embodiment is that, the step S33 may be executed based on the multi-party speech recognition model in the embodiment of the present application, specifically:

based on a dialect recognition network in a trained multi-party speech recognition model, carrying out deep feature extraction on speech acoustic features to obtain dialect embedding features corresponding to speech data to be recognized, wherein the part of network is mainly used for learning dialect information of the speech data to be recognized; and performing dimension-increasing coding on the voice acoustic characteristics based on a voice recognition network in the multi-party speech recognition model to obtain acoustic coding characteristics corresponding to the voice data to be recognized, wherein the partial network is mainly used for learning the acoustic information of the voice data to be recognized, and on the basis, the dialect information can be further learned, and the multi-party speech recognition model is described in detail with reference to the attached drawings.

S34: and the server performs dialect voice recognition on the voice data to be recognized based on the dialect embedding characteristics and the acoustic coding characteristics to obtain target text information and a target dialect category corresponding to the voice data to be recognized.

In the foregoing embodiment, based on the speech recognition method in the embodiment of the present application, when dialect speech recognition is performed on speech data to be recognized, first, speech acoustic features of each frame of speech data in the speech data to be recognized are extracted, and then, based on the speech acoustic features corresponding to each frame, deep learning and coding are further performed, so as to obtain a dialect embedding feature for characterizing the dialect information of the speech data to be recognized and an acoustic coding feature for characterizing the acoustic features of the speech data to be recognized, and further, based on these features, dialect speech recognition is performed on the speech data to be recognized, so as to obtain target text information and a target dialect category corresponding to the speech data to be recognized. The method and the device have the advantages that the learning of the dialect embedding features is added based on the characteristics of the target language dialect, the dialect embedding features and the acoustic coding features are combined, more dialect information is utilized to carry out dialect voice recognition, the to-be-recognized voice data of the target language are converted into text information, the dialect type to which the to-be-recognized voice data of the target language belong is further output, and the voice recognition of multiple dialects is efficiently and accurately realized.

The following describes in detail a multi-party speech recognition model in the embodiment of the present application, taking a target language as a Tibetan language as an example:

in the Tibetan language speech recognition method in the related art, modeling is usually performed only for a single dialect, such as a Tibetan language, and is limited by the construction of a pronunciation dictionary, the recognition performance of the mixed modeling of multiple Tibetan languages has a large loss compared with the performance of the modeling of the single dialect, the Tibetan language is a language of multiple dialects, and a model of the Tibetan language is difficult to apply in regions where the multiple dialects and the Kangba dialects are installed, so that three sets of Tibetan language recognition models for different dialects need to be built. On the other hand, the three dialects of the Tibetan language are greatly different, and the recognition performance of different dialects is obviously reduced by directly performing mixed modeling on the three dialects.

The multi-party speech recognition model provided in the embodiment of the application is a Tibetan language multi-party language end-to-end model based on the Tibetan language multi-party language characteristics, and can be used for recognizing Tibetan language voices of different dialects.

First, a dialect recognition network of a multi-party speech recognition model in the embodiment of the present application is described in detail:

when the target language is Tibetan, the speech data to be recognized is Tibetan speech data. In the embodiment of the application, the dialect recognition network is adopted to extract embedding (namely dialect embedding characteristics) of the Tibetan dialect.

Optionally, the dialect identifying network in the embodiment of the present application may be a time-delay neural network. When the dialect embedding is extracted based on the time-delay neural network, compared with a traditional statistical-based method such as a Gaussian Mixed Model (GMM), the time-delay neural network can learn information in data more effectively.

In an optional implementation manner, the Tibetan dialect imbedding extraction structure of the present application is mainly composed of three modules, that is, the dialect identification network in the embodiment of the present application includes three sub-networks, which are: a depth feature extraction subnetwork, a time-sequential pooling subnetwork, and a dialect classification subnetwork.

Referring to fig. 4, when performing deep feature extraction on the speech acoustic features based on the dialect recognition network described above, the method specifically includes the following steps:

s41: the server inputs the voice acoustic features of each frame of voice data into a depth feature extraction sub-network respectively to obtain the frame level depth features corresponding to each frame of voice data;

in the embodiment of the present application, the sub-network for depth feature extraction is used to extract information at a deeper level from the frame-level speech acoustic features, so as to obtain the frame-level depth features.

In the embodiment of the application, the 40-dimensional MFCC features are used for training the dialect voice recognition network, and the dialect voice recognition network shares the same features with the voice recognition network, so that the aim of saving resources is fulfilled.

For example, a certain speech data to be recognized contains 1000 frames of audio frames, that is, 1000 frames of speech data, and first, the respective corresponding speech acoustic features a of the 1000 frames of speech data are directly generated through operations such as frequency spectrum framing, time-frequency conversion, filtering, and the like, and the features are 40-dimensional MFCC features; furthermore, a depth feature extraction sub-network based on the dialect recognition network performs deep learning on the input MFCC features to obtain frame-level depth features b corresponding to each frame of voice data, for example, if the features are 512 dimensions, the step obtains 100 frame-level depth features b of 512 dimensions in total.

S42: the server integrates all frame-level depth features related to time sequence through a time sequence pooling sub-network to obtain a sentence-level feature vector with fixed dimensionality corresponding to the voice data to be recognized;

in the embodiment of the present application, since the speech signal is usually of indefinite length, it is necessary to further integrate and count the frame-level depth features related to the time sequence through the time sequence pooling sub-network, and finally output a sentence-level feature vector of a fixed dimension.

The selection of the pooling layers in the time-series pooling sub-network mainly includes global average pooling, maximum pooling, minimum pooling, and the like, and is not specifically limited herein.

Still taking the above 1000 frames of voice data as an example, the process integrates 1000 512-dimensional frame-level depth features b into 1 512-dimensional sentence-level feature vector c. In specific integration, feature values of the same dimension in the 1000 512-dimensional frame-level depth features b may be averaged to obtain a vector value of the dimension in the sentence-level feature vector c.

For example, the 1000 512-dimensional frame-level depth features b can be represented as: b₁~b₁₀₀₀. For sentence-level feature vector c, the vector value of the first dimension in the vector is b₁~b₁₀₀₀The mean value of the characteristic value of the first dimension, and the vector value of the second dimension in the vector is b₁~b₁₀₀₀Average of the second dimension feature values, …, and so on, to obtain b₁~b₁₀₀₀And (4) integrating the sentence-level feature vector c with 512 dimensions.

S43: and the server performs dimensionality reduction on the sentence-level feature vector to obtain dialect embedding features.

The network structure classification recognizer of the dialect classification sub-network used in the embodiment of the application is a full connection layer classifier, the input sentence level feature vectors with fixed dimensionality are mapped into three Tibetan dialects for target recognition through a two-layer full connection network, after a final activation function is carried out, the posterior probability of the corresponding dialects can be output, and the dialects corresponding to voice data can be determined based on the size of the probability value.

In the embodiment of the application, when the Tibetan language dialect recognition network is trained by using the Tibetan language voice data, a fixed-dimension dialect imbedding, namely a sentence-level feature vector, can be extracted from all Tibetan language training data through a penultimate layer in the dialect recognition network, and generally, the imbedding dimension is higher, and the dimension reduction needs to be performed by using methods such as LDA (latent Dirichlet Allocation) and the like, so that the distinguishability of the imbedding is enhanced. The embedding contains dialect related information, and can help a voice recognition model to better model voice with dialects.

Next, a speech recognition network of the multi-dialect speech recognition model in the embodiment of the present application will be described in detail:

the multi-party speech sound recognition model in the embodiment of the application is a Tibetan multi-party speech sound recognition system based on strong coupling dialect recognition. Under the traditional speech recognition framework, a corresponding pronunciation dictionary needs to be constructed for each language, the pronunciation dictionary contains the mapping relation from words to pronunciation sequences and is a bridge between an acoustic model and a language model, the pronunciation dictionary usually needs a large number of professional manual labels, a large number of resources are consumed, the construction is difficult in low-resource small languages such as Tibetan language, and the Tibetan language has three main dialects, so that the difficulty in obtaining is higher. The end-to-end speech recognition system usually directly performs joint modeling on speech to characters, joint training is performed on an acoustic model, a language model and a pronunciation dictionary under a traditional speech recognition framework, and complex pronunciation dictionary construction work is omitted.

Referring to fig. 5, a sequence-to-sequence end-to-end network structure is proposed in the embodiment of the present application. The network comprises two main modules: 1) an encoding (Encoder) module, which is responsible for encoding a sequence formed by the voice acoustic features of each frame of voice data into higher-dimensional features (namely, the acoustic encoding features are obtained by the lifting-dimensional encoding) so as to model the acoustic information; 2) a decode (Decoder) module responsible for decoding the output of the Encoder into corresponding text, modeling the language information. Meanwhile, in order to improve the speech recognition performance of the model, a CTC speech recognition branch is further added by adopting a multi-task learning method to assist in model training.

In the embodiment of the application, the end-to-end model can directly model the characters, only a dictionary corresponding to the modeling unit needs to be constructed, the Tibetan is a pinyin character, but no space is used for segmenting words, and the Tibetan is segmented by syllable characters, so that the Tibetan can be taken as the modeling unit in the embodiment of the application, and the corresponding dictionary of the Tibetan can be counted in the training set text in a syllable character segmentation mode.

In addition, in consideration of the fact that the pronunciation difference between the three dialects of the Tibetan language is large, the model trained by mixing the data of the three dialects often does not achieve the best performance on a single dialect, and the data of a single dialect is required to be finely adjusted on the mixed dialect model, but the mode generally affects the recognition performance of the other two dialects.

In the configuration shown in fig. 6, compared to fig. 5, in addition to text information obtained by dialect voice recognition of the Tibetan voice data, a dialect label corresponding to the Tibetan voice data is further output, which is also referred to as a dialect type.

In the embodiment of the present application, modeling is performed by using the dialect recognition and speech recognition strong coupling method, and the training set text needs to be modified first, that is, a corresponding dialect label is labeled at the end of each training set text, for example:

[bou]（[bou]representing the wei-defensive-zang dialect). Meanwhile, because the model can judge the dialect of the sentence while performing dialect speech recognition, the dialect recognition result can also be provided for subsequent services, for example, calling a speech synthesis system of different dialects through the recognized dialect.

In addition, in order to utilize more dialect information in the training process, the dialect imbedding extracted in the above process is simultaneously sent to the network for training, so that the speech recognition performance of the model on three dialects is further improved, and the method specifically comprises the following operations: after the imbedding of each dialect passes through a feedforward neural network, the imbedding is combined and spliced with the output of the Encoder module, and a specific example is shown in fig. 7, which is a schematic diagram of a model structure after the imbedding of the dialect is added in the embodiment of the application.

That is, when step S34 is executed, an alternative embodiment is:

firstly, carrying out high-dimensional feature extraction on an utterance embedding feature embedding through a feedforward neural network in a multi-dialect speech recognition model to obtain a dialect depth feature corresponding to speech data to be recognized; then, inputting the dialect depth features into a voice recognition network, and combining and splicing the dialect depth features and the acoustic coding features based on the voice recognition network to obtain splicing features corresponding to voice data to be recognized; and then, predicting based on the splicing characteristics to obtain target text information corresponding to the voice data to be recognized and a target dialect category. The feedforward neural network is mainly used for abstracting embedding into features with higher dimensionality.

In the embodiment of the application, a modeling unit is added to an output layer of an end-to-end model from a sequence to the sequence for modeling dialects, the extracted dialects are added, and the extracted dialects are combined and spliced with the output of an Encoder through a feedforward neural network, so that more dialect information can be utilized, and dialect identification is supported.

Optionally, the speech recognition network further includes two branches, which are respectively: a backbone classification subnetwork and a secondary classification subnetwork; as shown in fig. 7, wherein the Decoder part serves as a backbone classification sub-network and the CTC part serves as a secondary classification sub-network.

When the identification is performed based on these two branches, the following two methods can be specifically classified:

the first method is as follows: respectively inputting the splicing characteristics into a trunk classification sub-network and an auxiliary classification sub-network for decoding to obtain candidate results respectively output by the trunk classification sub-network and the auxiliary classification sub-network, wherein each candidate result comprises candidate text information and candidate dialect categories aiming at the voice data to be recognized; and then, selecting one from the candidate results, and taking the candidate text information and the candidate dialect category contained in the selected candidate result as the target text information and the target dialect category corresponding to the voice data to be recognized respectively.

The second method comprises the following steps: inputting the splicing characteristics into an auxiliary classification sub-network for decoding to obtain a plurality of candidate results output by the auxiliary classification sub-network; and then, inputting each candidate result into the backbone classification sub-network for decoding, and obtaining target text information and target dialect types corresponding to the voice data to be recognized.

Wherein, decoding is performed based on the backbone classification sub-network, and when a final result is obtained, the decoding actually means: inputting each candidate result into a main classification sub-network for decoding to obtain an evaluation value of each candidate result; and then selecting one of the candidate results based on the evaluation values, and respectively using the candidate text information and the candidate dialect category contained in the selected candidate result as the target text information and the target dialect category corresponding to the voice data to be recognized.

That is, in this way, a plurality of candidate results may be decoded by the CTC branch, and then sent to the Decoder for re-scoring, and an optimal sequence may be output as a final result by combining the final scores of the candidate results and the Decoder.

In addition to the two methods listed above, one branch may be optionally used, for example, only the decoder branch is used to obtain the candidate result, and the candidate text information and the candidate dialect category included in the candidate result output by the selected branch are respectively used as the target text information and the target dialect category corresponding to the voice data to be recognized.

It should be noted that the combination or single use of the several branches listed in this application is only an example, and can be flexibly configured in the practical application process. The CTC branches are arranged to help the model train better during training so as to obtain more accurate recognition results.

The following describes in detail the training process of the multi-party speech recognition model in the embodiment of the present application:

optionally, the multi-party speech recognition model is obtained by training in the following way:

firstly, a training sample data set consisting of training samples corresponding to a plurality of dialects contained in a target language is obtained, then, according to the training samples in the training sample data set, the multi-party speech sound recognition model is subjected to cyclic iterative training, and when the training is finished, the trained multi-party speech sound recognition model is output; in which the following operations are performed in one loop iteration training process, as shown with reference to fig. 8:

s81: the server selects training samples from the training sample data set, inputs the selected training samples into the multi-party speech sound recognition model, and obtains the predicted dialect categories and the predicted text information output by the multi-party speech sound recognition model;

optionally, the speech recognition network in the embodiment of the present application may include: a backbone classification subnetwork and a secondary classification subnetwork; thus, step S81 specifically includes the following two substeps:

s811: the server acquires a first prediction dialect category output by the dialect identification network;

s812: the server acquires the second prediction dialect category and the prediction text information output by the main classification sub-network and the auxiliary classification sub-network respectively.

The steps S811 and S812 may be executed synchronously or asynchronously without any time limitation.

After obtaining the types of the prediction dialects and the predicted text information, step S82 may be executed:

s82: and the server adjusts parameters of the multi-party speech sound recognition model based on the prediction dialect category, the prediction text information and the real text information.

Optionally, each training sample includes sample voice data, and a real dialect category and real text information corresponding to the sample voice data; on the basis of the above, step S82 specifically includes the following two substeps:

s821: the server adjusts parameters of the dialect recognition network based on the difference between the real dialect category and the first prediction speech category of the sample voice data in the training sample;

in this step, based on the difference between the real dialect class and the first predicted utterance class, the constructed loss function may be a function suitable for multi-classification, such as cross entropy, and is not specifically limited herein.

S822: the server adjusts parameters of the speech recognition network based on differences between the real dialect categories and the second predicted dialect categories and differences between real text information and predicted text information of the sample speech data in the training samples.

In this step, considering that two branches are actually included, it is necessary to combine the results output from each of the two branches to construct a loss function.

An alternative embodiment is that S822 can be implemented according to the flowchart shown in fig. 9, including the following steps:

s91: the server determines a first loss function corresponding to the backbone classification sub-network based on the difference between the second prediction dialect class and the real dialect class output by the backbone classification sub-network and the difference between the prediction text information and the real text information;

s92: the server determines a second loss function corresponding to the auxiliary classification sub-network based on the difference between the second prediction dialect class and the real dialect class output by the auxiliary classification branch network and the difference between the second prediction text information and the real text information;

s93: the server adjusts parameters of the voice recognition network based on the first loss function and the second loss function.

Specifically, a target loss function may be constructed based on the first loss function and the second loss function, as follows:

wherein the content of the first and second substances,

in order to input an acoustic sequence, the acoustic sequence is,

is an output sequence of characters that are composed of,

in order to correlate with the decoder main branch, i.e. the first loss function,

the second loss function, which is the function of CTC loss, associated with the CTC auxiliary branch, i.e.,

the contribution for controlling both can be set empirically, and is not particularly limited herein.

In addition, it should be noted that the multiparty speech sound recognition model in the embodiment of the present application may also be a Tibetan multiparty speech sound recognition system based on transfer learning. Specifically, the initial network parameters of the speech recognition network are obtained by performing parameter migration based on a standard speech recognition model, and the standard speech recognition model is obtained by performing training based on sample standard speech data, which is shown in fig. 10 and is a schematic structural diagram of a Tibetan dialect speech recognition model based on migration learning in the embodiment of the present application.

The standard speech can refer to Chinese, and the standard speech recognition model is a Chinese recognition model. The Tibetan language model is subjected to transfer learning by using the Chinese model trained by a large amount of data, so that the speech recognition performance of the model under the condition of multiple languages can be effectively improved.

Specifically, because the Tibetan language belongs to a Chinese language, the data acquisition difficulty is high, and the resources cannot be compared with rich languages such as Chinese, on the basis of a Tibetan language multi-dialect voice recognition model of a modeling unit added with a dialect recognition network and a dialect, the method transfers the Chinese model weights trained by a large amount of data to the Tibetan language model by a weight transfer method in transfer learning, replaces the dotted part shown in fig. 10 in the voice recognition network, randomly initializes the feedforward neural network parameters related to the outer output layer of the dotted part and embedding, reduces the training learning rate, and further uses the Tibetan language data for training, thereby achieving the purpose of utilizing the rich languages such as Chinese.

Taking user a, user B and user C as an example of performing voice chat, when user a and user B use Tibetan languages of different dialects and each send a voice message to user C, the voice message can be recognized based on the above listed methods, converted into text information, and labeled with the type of the dialect to which the voice message belongs, and further translated, and the like.

The specific process is as follows: firstly, respectively performing framing processing on two voice messages to obtain multi-frame voice data; further, through operations such as frequency spectrum framing, time-frequency conversion, filtering and the like, voice acoustic characteristics corresponding to each frame of voice data contained in the voice message are directly generated; furthermore, based on the model shown in fig. 7, the speech acoustic features are subjected to deep feature extraction and ascending dimension coding processing, so as to obtain dialect embedding features and acoustic coding features corresponding to the speech message.

Further, extracting high-dimensional features of the obtained acoustic coding features through a feedforward neural network FNN to obtain corresponding dialect depth features; further, the dialect depth features and the acoustic coding features are combined and spliced, and are respectively identified through the main classification sub-network and the auxiliary classification sub-network to obtain a final identification result, that is, the target text information and the target dialect category.

For example, the chinese text corresponding to the voice message actually sent by the user a to the user C is: sheet of paperThree you are good, i call lie four, and want to ask you whether you have recently participated in a certain activity, the identified target text information is:

the target dialect categories are: kang ba. The chinese text corresponding to the voice message actually sent by the user B to the user C is: today, the weather is clear, and if the user does not need to go out and go to the outing together, the recognized target text information is as follows:

the target dialect categories are: and (5) defense and storage.

When the user C views the two voice messages, corresponding target text information and target dialect types can be obtained based on the method, and on the basis, the target text information and the target dialect types are further converted into Chinese or other languages and the like and respectively replied to the user A and the user B, so that barrier-free communication is realized.

It should be noted that the above listed application scenarios are only examples, and in fact, any application scenario related to speech recognition of a target language can be executed by using the method in the embodiment of the present application, and is not limited specifically herein.

In addition, the method provided by the application can be used for training from the speech recognition sequence to the sequence end-to-end model of any multi-dialect, a Tibetan pronunciation dictionary does not need to be marked manually, extra dialect information can be provided for the speech recognition model by extracting the dialect embedding, and the speech recognition performance of the model on the multi-dialect is improved. And through a strong coupling dialect recognition method, a dialect modeling unit is added on an output layer, and a model is forced to output a dialect label, so that the voice recognition model has certain distinguishability on dialects. Finally, the Chinese model is used for transfer learning, the purpose of utilizing other resources to enrich language knowledge is achieved, the speech recognition performance of the model on the dialect is further improved, meanwhile, the model can output the dialect label for dialect speech recognition, and therefore dialect results can be provided for subsequent services such as speech synthesis and the like while dialect speech recognition is carried out.

Based on the same inventive concept, the embodiment of the application also provides a voice recognition device. As shown in fig. 11, which is a schematic structural diagram of a speech recognition apparatus 1100 in an embodiment of the present application, the speech recognition apparatus may include:

a voice acquiring unit 1101 configured to acquire voice data to be recognized of a target language, where the target language includes multiple dialects;

the first extraction unit 1102 is configured to extract voice acoustic features corresponding to each frame of voice data in the voice data to be recognized after the voice data to be recognized is subjected to framing processing;

a second extraction unit 1103, configured to perform deep feature extraction on the speech acoustic features to obtain dialect embedding features corresponding to the speech data to be recognized; the voice acoustic features are coded to obtain acoustic coding features corresponding to the voice data to be recognized, and the dialect embedding features are used for representing dialect information of a dialect to which the voice data to be recognized belongs;

and the voice recognition unit 1104 is configured to perform dialect voice recognition on the voice data to be recognized based on the dialect embedding feature and the acoustic coding feature, and obtain target text information and a target dialect category corresponding to the voice data to be recognized.

Optionally, the second extracting unit 1103 is specifically configured to:

based on a dialect recognition network in a trained multi-party speech recognition model, performing deep feature extraction on speech acoustic features to obtain dialect embedding features corresponding to speech data to be recognized;

the method for obtaining the acoustic coding characteristics corresponding to the voice data to be recognized by performing the ascending-dimension coding on the voice acoustic characteristics comprises the following steps:

Optionally, the speech recognition unit 1104 is specifically configured to:

carrying out high-dimensional feature extraction on the dialect embedding features through a feedforward neural network in the multi-party speech recognition model to obtain dialect depth features corresponding to the voice data to be recognized;

inputting the dialect depth features into a voice recognition network, and combining and splicing the dialect depth features and the acoustic coding features based on the voice recognition network to obtain splicing features corresponding to voice data to be recognized;

and predicting based on the splicing characteristics to obtain target text information corresponding to the voice data to be recognized and a target dialect category.

Optionally, the voice recognition network includes a backbone classification sub-network and an auxiliary classification sub-network; the speech recognition unit 1104 is specifically configured to:

respectively inputting the splicing characteristics into a trunk classification sub-network and an auxiliary classification sub-network for decoding to obtain candidate results respectively output by the trunk classification sub-network and the auxiliary classification sub-network, wherein each candidate result comprises candidate text information and candidate dialect categories aiming at the voice data to be recognized;

inputting the splicing characteristics into an auxiliary classification sub-network for decoding to obtain a plurality of candidate results output by the auxiliary classification sub-network, wherein each candidate result comprises candidate text information and candidate dialect categories aiming at the voice data to be recognized;

and respectively inputting each candidate result into the main classification sub-network for decoding to obtain target text information and a target dialect category corresponding to the voice data to be recognized.

Optionally, the speech recognition unit 1104 is specifically configured to:

inputting each candidate result into a main classification sub-network for decoding to obtain an evaluation value of each candidate result;

one of the candidate results is selected based on the respective evaluation values, and the candidate text information and the candidate dialect category included in the selected candidate result are respectively taken as the target text information and the target dialect category corresponding to the speech data to be recognized.

Optionally, the dialect identification network includes a deep feature extraction sub-network, a time-sequence pooling sub-network and a dialect classification sub-network; the second extraction unit 1103 is specifically configured to:

respectively inputting the voice acoustic features of each frame of voice data into a depth feature extraction sub-network, and acquiring the frame level depth features corresponding to each frame of voice data;

integrating each frame level depth feature related to the time sequence through a time sequence pooling sub-network to obtain a sentence level feature vector with fixed dimensionality corresponding to the voice data to be recognized;

and performing dimensionality reduction on the sentence-level feature vector to obtain dialect embedding features.

Optionally, the apparatus further comprises:

the model training unit 1105 is configured to obtain the multi-party speech recognition model through the following training:

according to training samples in a training sample data set, executing loop iterative training on the multi-party speech sound recognition model, and outputting the trained multi-party speech sound recognition model when the training is finished, wherein the training sample data set comprises training samples corresponding to various dialects contained in a target language; wherein the following operations are executed in a loop iteration training process:

Optionally, the voice recognition network includes a backbone classification sub-network and an auxiliary classification sub-network; the model training unit 1105 is specifically configured to:

acquiring a first prediction dialect category output by a dialect identification network; and

and acquiring a second prediction dialect category and prediction text information output by the backbone classification sub-network and the auxiliary classification sub-network respectively.

the model training unit 1105 is specifically configured to:

adjusting parameters of a dialect recognition network based on the difference between the real dialect category and the first prediction speech category of the sample voice data in the training sample; and

and adjusting parameters of the voice recognition network based on the difference between the real dialect class and each second prediction dialect class and the difference between the real text information and each prediction text information of the sample voice data in the training sample.

Optionally, the model training unit 1105 is specifically configured to:

determining a first loss function corresponding to the trunk classification sub-network based on the difference between the second prediction dialect class and the real dialect class output by the trunk classification sub-network and the difference between the prediction text information and the real text information; and

determining a second loss function corresponding to the auxiliary classification sub-network based on the difference between the second prediction dialect class and the real dialect class output by the auxiliary classification sub-network and the difference between the second prediction text information and the real text information;

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

Having described the speech recognition method and apparatus according to an exemplary embodiment of the present application, next, a speech recognition apparatus according to another exemplary embodiment of the present application will be described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a speech recognition apparatus according to the present application may include at least a processor and a memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the speech recognition method according to various exemplary embodiments of the present application described in the specification. For example, the processor may perform the steps as shown in fig. 3.

The electronic equipment is based on the same inventive concept as the method embodiment, and the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server, such as server 220 shown in FIG. 2. In this embodiment, the electronic device may be configured as shown in fig. 12, and include a memory 1201, a communication module 1203, and one or more processors 1202.

A memory 1201 for storing computer programs executed by the processor 1202. The memory 1201 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

Memory 1201 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1201 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1201 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1201 may be a combination of the above memories.

The processor 1202 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. A processor 1202 for implementing the above-described speech recognition method when calling a computer program stored in the memory 1201.

The communication module 1203 is used for communicating with the terminal device and other servers.

In the embodiment of the present application, the specific connection medium between the memory 1201, the communication module 1203 and the processor 1202 is not limited. In fig. 12, the memory 1201 and the processor 1202 are connected by a bus 1204, the bus 1204 is depicted by a thick line in fig. 12, and the connection manner between other components is merely illustrative and not limited. The bus 1204 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 12, but only one bus or one type of bus is not depicted.

The memory 1201 stores a computer storage medium, and the computer storage medium stores computer-executable instructions for implementing the speech recognition method according to the embodiment of the present application. The processor 1202 is configured to perform the speech recognition method described above, as shown in FIG. 3.

In another embodiment, the electronic device may also be other electronic devices, such as the terminal device 210 shown in fig. 2. In this embodiment, the structure of the electronic device may be as shown in fig. 13, including: a communication assembly 1310, a memory 1320, a display unit 1330, a camera 1340, a sensor 1350, an audio circuit 1360, a bluetooth module 1370, a processor 1380, and the like.

The communication component 1310 is for communicating with a server. In some embodiments, a Wireless Fidelity (WiFi) module may be included, the WiFi module being a short-range Wireless transmission technology, through which the electronic device may help the user to transmit and receive information.

Memory 1320 may be used to store software programs and data. The processor 1380 performs various functions of the terminal device 210 and data processing by executing software programs or data stored in the memory 1320. The memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. The memory 1320 stores an operating system that enables the terminal device 210 to operate. The memory 1320 may store an operating system and various application programs, and may also store codes for performing the speech recognition method according to the embodiment of the present application.

The display unit 1330 may also be used to display information input by or provided to the user and a Graphical User Interface (GUI) of various menus of the terminal apparatus 210. Specifically, the display unit 1330 may include a display screen 1332 provided on the front surface of the terminal device 210. The display 1332 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 1330 may be configured to display a client operation interface in the embodiment of the present application.

The display unit 1330 may also be configured to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device 210, and specifically, the display unit 1330 may include a touch screen 1331 disposed on the front surface of the terminal device 210 and configured to collect touch operations by a user thereon or nearby, such as clicking a button, dragging a scroll box, and the like.

The touch screen 1331 may cover the display screen 1332, or the touch screen 1331 and the display screen 1332 may be integrated to implement the input and output functions of the terminal device 210, and after integration, the touch screen may be referred to as a touch display screen for short. The display unit 1330 may display the application programs and the corresponding operation steps.

The camera 1340 may be used to capture still images, and the user may post comments on the images taken by the camera 1340 through the application. The number of the cameras 1340 may be one or more. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing elements convert the light signals into electrical signals, which are then passed to a processor 1380 for conversion into digital image signals.

The terminal device may further comprise at least one sensor 1350, such as an acceleration sensor 1351, a distance sensor 1352, a fingerprint sensor 1353, a temperature sensor 1354. The terminal device may also be configured with other sensors such as a gyroscope, barometer, hygrometer, thermometer, infrared sensor, light sensor, motion sensor, and the like.

The audio circuit 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the terminal device 210. The audio circuit 1360 may transmit the electrical signal converted from the received voice data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output. The terminal device 210 may also be provided with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into voice data after being received by the audio circuit 1360, and then outputs the voice data to the communication module 1310 to be transmitted to, for example, another terminal device 210, or outputs the voice data to the memory 1320 for further processing.

The bluetooth module 1370 is used for information interaction with other bluetooth devices having a bluetooth module through a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module through the bluetooth module 1370, so as to perform data interaction.

The processor 1380 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions of the terminal device and processes data by running or executing software programs stored in the memory 1320 and calling data stored in the memory 1320. In some embodiments, processor 1380 may include one or more processing units; the processor 1380 may also integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a baseband processor, which primarily handles wireless communications. It will be appreciated that the baseband processor may not be integrated into the processor 1380. The processor 1380 in the present application may run an operating system, an application, a user interface display, and a touch response, as well as the speech recognition methods of the embodiments of the present application. Additionally, a processor 1380 is coupled to the display unit 1330.

In some possible embodiments, the various aspects of the speech recognition method provided herein may also be implemented in the form of a program product comprising program code for causing a computer device to perform the steps of the speech recognition method according to various exemplary embodiments of the present application described above in this specification when the program product is run on a computer device, for example the computer device may perform the steps as shown in fig. 3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the obtaining of dialect embedding features corresponding to the voice data to be recognized by performing deep feature extraction on the voice acoustic features comprises:

3. The method of claim 2, wherein the dialect speech recognition is performed on the speech data to be recognized based on the dialect embedding feature and the acoustic coding feature, and obtaining target text information and a target dialect category corresponding to the speech data to be recognized comprises:

4. The method of claim 3, wherein the speech recognition network comprises a backbone classification sub-network and a supplementary classification sub-network; the predicting based on the splicing characteristics to obtain the target text information corresponding to the voice data to be recognized and the target dialect category comprises the following steps:

5. The method of claim 3, wherein the speech recognition network comprises a backbone classification sub-network and a supplementary classification sub-network; the predicting based on the splicing characteristics to obtain the target text information corresponding to the voice data to be recognized and the target dialect category comprises the following steps:

6. The method of claim 5, wherein the respectively inputting each candidate result into the backbone classification sub-network for decoding to obtain the target text information and the target dialect category corresponding to the voice data to be recognized comprises:

7. The method of claim 2, wherein the dialect identification network includes a deep feature extraction sub-network, a temporal pooling sub-network, and a dialect classification sub-network;

the method comprises the following steps of carrying out deep feature extraction on the voice acoustic features based on a dialect recognition network in a trained multi-party speech sound recognition model, and acquiring dialect embedding features corresponding to the voice data to be recognized, wherein the method comprises the following steps:

8. The method of any of claims 2 to 7, wherein the multi-lingual speech recognition model is trained by:

9. The method of claim 8, wherein the speech recognition network comprises a backbone classification sub-network and a supplementary classification sub-network; the obtaining of the predicted dialect category and the predicted text information output by the multi-party speech sound recognition model comprises:

10. The method of claim 9, wherein each training sample comprises a sample speech data, and the sample speech data corresponds to a real dialect class and real text information;

the parameter adjustment of the multi-party speech sound recognition model based on the predicted dialect category, the predicted text information and the real text information comprises the following steps:

11. The method of claim 9, wherein the performing parameter adjustments on the speech recognition network based on differences between the real dialect categories and respective second predicted dialect categories and differences between real text information and respective predicted text information of sample speech data in the training samples comprises:

12. The method according to any one of claims 2 to 7 and 9 to 11, wherein the initial network parameters of the speech recognition network are obtained by performing parameter migration based on a standard speech recognition model, and the standard speech recognition model is obtained by performing training based on sample standard speech data.

13. A speech recognition apparatus, comprising:

14. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to carry out the steps of the method of any one of claims 1 to 12, when said storage medium is run on said electronic device.

16. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 12.