CN117636849A

CN117636849A - Speech recognition method and speech recognition model training method

Info

Publication number: CN117636849A
Application number: CN202311460525.9A
Authority: CN
Inventors: 俞帆; 王浩旭; 张仕良
Original assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-03-01

Abstract

The embodiment of the specification provides a voice recognition method and a voice recognition model training method, wherein the voice recognition method comprises the following steps: acquiring target voice data and associated text data corresponding to the target voice data; invoking a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training; invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features; and calling a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data. Because the voice recognition model has higher keyword recognition capability, and abundant contextual text information is integrated in the target voice characteristics, the voice recognition performance is improved.

Description

Speech recognition method and speech recognition model training method

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a voice recognition method.

Background

With the development of computer technology, end-to-end speech recognition achieves better recognition results in many scenarios. However, in a complex online conference or online course scenario, many errors are likely to occur in recognition of keywords such as some personal names, place names, technical nouns, entity nouns and the like in voice, and these keywords are often very important for users.

At present, a large amount of text information is generally utilized to improve the recognition performance of a voice recognition model on keywords, however, merely increasing the amount of text information still cannot improve the recognition performance well, so that the recognition performance of the voice recognition model is still poor, and therefore, a high-performance voice recognition scheme is needed.

Disclosure of Invention

In view of this, the present embodiments provide a voice recognition method. One or more embodiments of the present disclosure relate to a method for training a speech recognition model, a speech recognition apparatus, a speech recognition model training apparatus, a computing device, a computer-readable storage medium, and a computer program for solving the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided a speech recognition method, including:

Acquiring target voice data and associated text data corresponding to the target voice data;

invoking a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training;

invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features;

and calling a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data.

According to a second aspect of embodiments of the present disclosure, there is provided a method for training a speech recognition model, applied to cloud-side equipment, including:

acquiring a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label, and the sample text label comprises a sample identification label and a sample keyword label;

invoking a voice encoding unit of the voice recognition model, encoding sample voice data to obtain sample voice characteristics, invoking a text encoding unit of the voice recognition model, and encoding sample associated text data to obtain sample text characteristics;

Invoking a first fusion unit of a voice recognition model, fusing sample text features to sample voice features to obtain sample fusion voice features, and invoking a second fusion unit, fusing the sample voice features to the sample text features to obtain sample fusion text features;

according to the sample fusion voice characteristics and the sample fusion text characteristics, determining a prediction recognition result and a prediction keyword corresponding to the sample voice data;

and adjusting model parameters of the voice recognition model according to the sample recognition label, the sample keyword label, the prediction recognition result and the prediction keyword to obtain the trained voice recognition model.

According to a third aspect of embodiments of the present specification, there is provided a speech recognition apparatus comprising:

the first acquisition module is configured to acquire target voice data and associated text data corresponding to the target voice data;

the first coding module is configured to call a voice coding unit of a voice recognition model, code target voice data to obtain initial voice characteristics, call a text coding unit of the voice recognition model, and code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training;

The first fusion module is configured to call a first fusion unit of the voice recognition model, fuse the initial text characteristics to the initial voice characteristics and obtain target voice characteristics;

the decoding module is configured to call a decoding unit of the voice recognition model, and decodes the target voice characteristics to obtain a voice recognition result of the target voice data.

According to a fourth aspect of embodiments of the present specification, there is provided a speech recognition model training apparatus applied to cloud-side equipment, including:

the second acquisition module is configured to acquire a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label, and the sample text label comprises a sample identification label and a sample keyword label;

the second coding module is configured to call a voice coding unit of the voice recognition model, code the sample voice data to obtain sample voice characteristics, call a text coding unit of the voice recognition model, and code the sample associated text data to obtain sample text characteristics;

the second fusion module is configured to call a first fusion unit of the voice recognition model, fuse the sample text features to the sample voice features to obtain sample fusion voice features, and call the second fusion unit, fuse the sample voice features to the sample text features to obtain sample fusion text features;

The determining module is configured to determine a prediction recognition result and a prediction keyword corresponding to the sample voice data according to the sample fusion voice characteristics and the sample fusion text characteristics;

and the adjusting module is configured to adjust the model parameters of the voice recognition model according to the sample recognition label, the sample keyword label, the prediction recognition result and the prediction keyword to obtain a trained voice recognition model.

According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the methods provided in the first or second aspects above.

According to a sixth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method provided in the first or second aspect above.

According to a seventh aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the method provided in the first or second aspect described above.

According to the voice recognition method provided by the embodiment of the specification, target voice data and associated text data corresponding to the target voice data are obtained; invoking a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training; invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features; and calling a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data. Because the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training, the voice recognition model has higher keyword recognition capability, and the target voice features are obtained by fusing initial text features to the initial voice features, so that abundant contextual text information is fused into the target voice features, and the voice recognition performance is improved.

Drawings

FIG. 1 is a block diagram of a speech recognition system according to one embodiment of the present disclosure;

FIG. 2 is a block diagram of another speech recognition system provided in one embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of speech recognition provided in one embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for training a speech recognition model according to one embodiment of the present disclosure;

FIG. 5 is a process flow diagram of a speech recognition method according to one embodiment of the present disclosure;

FIG. 6 is a process flow diagram of a method for training a speech recognition model according to one embodiment of the present disclosure;

FIG. 7 is an interface diagram of a speech recognition interface provided in one embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a speech recognition device according to one embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.

First, terms related to one or more embodiments of the present specification will be explained.

And (3) voice recognition: speech recognition (ASR, automatic Speech Recognition) is one of the artificial intelligence techniques, whose goal is to convert speech content into corresponding text.

Audio-video voice recognition: audio-video speech recognition (AVSR, audio-Visual Speech Recognition) is a technique that combines Audio and video signals for speech recognition.

An encoding unit: an encoding unit (Encoder), also known as an Encoder, is a model for converting input data into a set of hidden states, typically used for processing sequence data, such as text, audio, video, etc.

Decoding unit: a decoding unit (Decoder), also called Decoder, is a model for converting a set of hidden states into an output sequence, typically used for processing sequence data, such as text, audio, video, etc.

Time sequence classification: timing classification (CTC, connectionist Temporal Classification) is a model for speech recognition and sequence labeling tasks. The main function of the connection timing classification is to convert an input sequence into an output sequence.

Cross entropy loss: cross-entropy loss (Cross-entropy loss) is a commonly used loss function, primarily used to measure the difference between the predicted and actual results of a model.

Optical character recognition: optical character recognition (OCR, optical Character Recognition) is a technique for converting text in an image into computer-readable text for automatically recognizing and extracting text information in an image, such as scanned documents, books, newspapers, and the like.

Keyword extraction: keyword extraction (Keyword extraction) is a text processing technology, and is mainly used for automatically extracting representative and important keywords from texts, and keyword extraction is commonly used in tasks such as information retrieval, text summarization, machine translation and the like.

Keyword list: keyword list, also known as bias list (bias list), is a technique for improving model predictions, mainly for handling imbalance problems in data sets. When the model is trained, the model can be trained by preferentially using the samples in the bias word list, so that the model pays more attention to the categories with smaller quantity, and the performance of the model is improved.

In an online conference or online course scenario, a speech recognition model is prone to more recognition errors for some personal names, place names, technical nouns and other entity names, and these keywords are often important to users. At the same time, in this case, text data (such as slides in video) may improve the effectiveness of the speech recognition model. Therefore, how to effectively combine text data and voice data, and improving the effect of voice recognition is an important issue. Second, the model of simultaneous modeling of speech and text is complex and how to efficiently obtain the required keywords from the text data is a challenging problem.

Currently, a decoding process can be biased towards keyword phrases by integrating an independently trained language model through Shallow fusion (horizontal fusion), wherein the language model is implemented by adjusting the posterior probability of the keyword phrases. However, shallow fusion requires retraining language models for different keyword lists and determining fusion weights in each case, which consumes a lot of time and computational resources, resulting in still poor recognition performance of the speech recognition model.

In order to solve the above-mentioned problems, an embodiment of the present disclosure proposes a speech recognition method, which improves speech recognition performance by using text data, and specifically, obtains target speech data and associated text data corresponding to the target speech data; invoking a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training; invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features; and calling a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data. Because the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training, the voice recognition model has higher keyword recognition capability, and the target voice features are obtained by fusing initial text features to the initial voice features, so that abundant contextual text information is fused into the target voice features, and the voice recognition performance is improved.

In the present specification, a speech recognition method is provided, and the present specification relates to a speech recognition model training method, a speech recognition apparatus, a speech recognition model training apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

Referring to fig. 1, fig. 1 illustrates an architecture diagram of a speech recognition system according to one embodiment of the present disclosure, where the speech recognition system may include a client 100 and a server 200;

the client 100 is configured to send target voice data and associated text data corresponding to the target voice data to the server 200;

the server 200 is configured to invoke a speech encoding unit of a speech recognition model, encode target speech data to obtain initial speech features, invoke a text encoding unit of the speech recognition model, and encode associated text data to obtain initial text features, where the speech recognition model is obtained by training based on speech recognition loss and keyword prediction loss; invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features; invoking a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data; transmitting the voice recognition result to the client 100;

The client 100 is further configured to receive a voice recognition result sent by the server 200.

By applying the scheme of the embodiment of the specification, the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training, so that the voice recognition model has higher keyword recognition capability, and the target voice features are obtained by fusing initial text features to the initial voice features, so that abundant contextual text information is fused into the target voice features, and the voice recognition performance is improved.

Referring to fig. 2, fig. 2 illustrates an architecture diagram of another speech recognition system provided in one embodiment of the present disclosure, where the speech recognition system may include a plurality of clients 100 and a server 200, where the clients 100 may be end-side devices and the server 200 may be cloud-side devices. Communication connection can be established between the plurality of clients 100 through the server 200, and in a speech recognition scenario, the server 200 is used to provide speech recognition services between the plurality of clients 100, and the plurality of clients 100 can respectively serve as a transmitting end or a receiving end, so that communication is realized through the server 200.

The user may interact with the server 200 through the client 100 to receive data transmitted from other clients 100, or transmit data to other clients 100, etc. In the speech recognition scenario, it may be that the user issues a data stream to the server 200 through the client 100, and the server 200 generates a speech recognition result according to the data stream and pushes the speech recognition result to other clients that establish communication.

Wherein, the client 100 and the server 200 establish a connection through a network. The network provides a medium for a communication link between client 100 and server 200. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being distributed to the server 200.

The client 100 may be a browser, APP (Application), or a web Application such as H5 (HyperText Markup Language, hypertext markup language (htv) 5 th edition) Application, or a light Application (also called applet, a lightweight Application) or cloud Application, etc., and the client 100 may be based on a software development kit (SDK, software Development Kit) of a corresponding service provided by the server 200, such as a real-time communication (RTC, real Time Communication) based SDK development acquisition, etc. The client 100 may be deployed in an electronic device, need to run depending on the device or some APP in the device, etc. The electronic device may for example have a display screen and support information browsing etc. as may be a personal mobile terminal such as a mobile phone, tablet computer, personal computer etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.

The server 200 may include a server that provides various services, such as a server that provides communication services for multiple clients, a server for background training that provides support for a model used on a client, a server that processes data sent by a client, and so on. It should be noted that, the server 200 may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system or a server that incorporates a blockchain. The server may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be noted that, the voice recognition method provided in the embodiment of the present disclosure is generally executed by the server, but in other embodiments of the present disclosure, the client may also have a similar function to the server, so as to execute the voice recognition method provided in the embodiment of the present disclosure. In other embodiments, the voice recognition method provided in the embodiments of the present disclosure may be performed by the client and the server together.

Referring to fig. 3, fig. 3 shows a flowchart of a voice recognition method according to an embodiment of the present disclosure, which specifically includes the following steps:

step 302: and acquiring the target voice data and associated text data corresponding to the target voice data.

In one or more embodiments of the present disclosure, target voice data and associated text data corresponding to the target voice data may be obtained, so as to improve voice recognition performance based on the text data.

Specifically, the target voice data refers to a voice recognition object, and the target voice data may also be referred to as voice data to be recognized. The target voice data may be voice data in various scenes, such as voice data in a conference scene, voice data in a teaching scene. The related text data refers to text data having a related relationship with the target voice data. The associated text data may be text data that is synchronized in real-time with the target voice data, such as text information for slides in online conferences and lesson videos. The associated text data may also be associated identification information of text data synchronized with the target voice data in real time, such as a page number of a file, and further obtain the text data synchronized with the target voice data in real time based on the page number of the file. The associated text information may also be a preset segmentation indicator < blank >, indicating that the target speech data is not associated with text data.

In practical applications, there are various ways of obtaining the target voice data and the associated text data corresponding to the target voice data, and the method is specifically selected according to the practical situation, which is not limited in any way in the embodiments of the present disclosure.

In one possible implementation manner of the present disclosure, target voice data sent by a user through a client may be received, and associated text data corresponding to the target voice data. In another possible implementation manner of the present disclosure, since the text in the slide in the online conference and the course video is synchronized with the voice in real time, which means that there is an inherent association and a strong context between the voice and the text, the target voice data can be parsed from the target video, and the slide of the target video is subjected to optical character recognition, so as to obtain the associated text data corresponding to the target voice data. In still another possible implementation manner of the present disclosure, the target voice data and associated text data corresponding to the target voice data may be read from another data acquisition device or database.

Step 304: invoking a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training.

In one or more embodiments of the present disclosure, after target voice data and associated text data corresponding to the target voice data are obtained, further, high-dimensional feature learning may be performed on the target voice data and the associated text data by using a voice recognition model of a dual-coding unit structure of voice and text, so as to obtain initial voice features and initial text features.

Specifically, the speech recognition model may be referred to as a keyword speech recognition model (LCB-net, long-Context Biasing network). The speech recognition loss characterization is used for recognizing the speech recognition accuracy of the sample speech features after the sample text features are fused, and the keyword prediction loss characterization is used for predicting the keyword of the sample text features after the sample speech features are fused. The speech recognition model may also be trained based on speech recognition (CE) loss, keyword prediction (BCE, binary Cross Entropy) loss, and reference matrix prediction loss, where the reference matrix prediction loss characterizes the coding accuracy of the coding unit for assisting the coding unit in better modeling of speech features.

It should be noted that, in order to improve the fusion effect of text and voice and make the voice recognition model more flexible, the embodiment of the specification proposes a scheme of joint optimization of the voice recognition model and the keyword prediction unit. Within the joint learning framework, the speech recognition model can perform keyword editing on potential features of speech (high-latitude features of speech or hidden layer characterization of speech in the middle layer of the network), so that different keyword lists can be quickly adapted. The keyword editing can be understood as integrating the information of the keyword list into the voice feature, so that the voice feature learns the corresponding keyword information, and the recognition effect of the keyword is improved. Therefore, in an alternative embodiment of the present disclosure, before the text encoding unit of the speech recognition model is invoked to encode the associated text data to obtain the initial text feature, keyword extraction may be further performed on the associated text data to obtain the target keyword, and the text encoding unit of the speech recognition model is further invoked to encode the target keyword to obtain the initial text feature.

It should be noted that, since the target keywords are not a completed sentence, but are words, the target keywords may be separated and connected by using the preset separation identifier, and the connected target keywords may be used as the updated associated text data. Further, the target keywords can be subjected to disorder treatment to obtain the target keywords after disorder; and carrying out separation connection on the disordered target keywords according to a preset separation mark.

Step 306: and calling a first fusion unit of the voice recognition model to fuse the initial text characteristics to the initial voice characteristics so as to obtain target voice characteristics.

In one or more embodiments of the present disclosure, target voice data and associated text data corresponding to the target voice data are obtained; invoking a voice encoding unit of the voice recognition model to encode target voice data to obtain initial voice characteristics, invoking a text encoding unit of the voice recognition model to encode associated text data to obtain initial text characteristics, and further invoking a first fusion unit of the voice recognition model to fuse the initial text characteristics to the initial voice characteristics to obtain target voice characteristics.

Specifically, the first fusing unit may fuse the initial text feature to the initial speech feature based on different fusion strategies. Fusion policies include, but are not limited to, feature level fusion policies, attention-based mechanism fusion policies. Taking attention mechanism based fusion policy as an example, the first fusion unit may be a speech-Context (AC) cross attention unit for fusing the initial text feature to the initial speech feature based on the cross attention mechanism to obtain the target speech feature. The target speech feature is a fusion speech feature.

In practical application, the first fusion unit of the speech recognition model is called to fuse the initial text feature to the initial speech feature, and various modes for obtaining the target speech feature are selected according to practical situations, which is not limited in the embodiment of the present specification. In one possible implementation manner of the present disclosure, a first fusion unit of the speech recognition model may be invoked to directly add or multiply the initial speech feature and the initial text feature to obtain the target speech feature. For example, the mel-frequency cepstral coefficient of the speech and the bag-of-word feature of the text may be added to obtain the target speech feature.

In another possible implementation manner of the present disclosure, a first fusion unit of a speech recognition model may be invoked, an initial text feature is fused to an initial speech feature based on a cross-attention mechanism, and a target speech feature is obtained, that is, the first fusion unit of the speech recognition model is invoked, and the initial text feature is fused to the initial speech feature, so as to obtain the target speech feature, and may include the following steps:

and calling a first fusion unit of the voice recognition model, taking the initial voice feature as a target query vector, taking the initial text feature as a target key vector and a target value vector, and calculating by using a cross attention mechanism to obtain the target voice feature.

Specifically, the query vector is used to calculate the attention weight, the key vector is used to represent the importance of the input vector, and the value vector is the final output vector. Cross-Attention (Cross-Attention) is an Attention mechanism that deals with the relationship between two input sequences. The cross-attention mechanism may process two input sequences simultaneously and may update the hidden states of the two sequences simultaneously.

When the target voice feature is obtained by calculating by using the cross attention mechanism, the dot product of the target query vector and the target key vector can be calculated to obtain a target attention weight vector, and the target attention weight vector and the target value vector are further weighted and summed to obtain the target voice feature.

By applying the scheme of the embodiment of the specification, a first fusion unit of a voice recognition model is called, initial voice features are used as target query vectors, initial text features are used as target key vectors and target value vectors, and the target voice features are calculated by using a cross attention mechanism. Through the calculation of the cross attention mechanism, the initial text features can be fused to the initial voice features, and rich contextual text information is fused to the target voice features, so that the voice recognition performance is improved.

Step 308: and calling a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data.

In one or more embodiments of the present disclosure, target voice data and associated text data corresponding to the target voice data are obtained; invoking a voice encoding unit of the voice recognition model to encode target voice data to obtain initial voice characteristics, invoking a text encoding unit of the voice recognition model to encode associated text data to obtain initial text characteristics; and invoking a first fusion unit of the voice recognition model to fuse the initial text feature to the initial voice feature, and further invoking a decoding unit of the voice recognition model to decode the target voice feature to obtain a voice recognition result of the target voice data after obtaining the target voice feature.

Specifically, the voice recognition result is text data corresponding to the target voice data, and the text data comprises voice content in the target voice data.

In practical application, the decoding unit of the speech recognition model is called, and various modes for decoding the target speech features to obtain the speech recognition result of the target speech data are selected according to practical situations, which is not limited in the specification. In one possible implementation manner of the present disclosure, the target speech feature may be directly input into the decoding unit, so as to obtain a speech recognition result. In another possible implementation manner of the present disclosure, the target speech feature may be input to the time sequence classification unit to obtain the reference matrix, and the reference matrix and the target speech feature may be further input to the decoding unit to obtain the speech recognition result.

In an optional embodiment of the present disclosure, before the decoding unit that invokes the speech recognition model decodes the target speech feature to obtain the speech recognition result of the target speech data, the method may further include the following steps:

invoking a time sequence classification unit of the voice recognition model, and classifying the target voice features to obtain a reference matrix corresponding to the target voice data, wherein the reference matrix characterizes the decoding path distribution probability of the target voice features;

invoking a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data, wherein the method comprises the following steps:

and calling a decoding unit of the voice recognition model, and decoding the target voice features based on the reference matrix to obtain a voice recognition result of the target voice data.

Specifically, when the time sequence classification unit of the voice recognition model is called to classify the target voice features to obtain the reference matrix corresponding to the target voice data, the target voice features can be directly input into the time sequence classification unit to obtain the reference matrix.

Further, after the reference matrix is obtained, the reference matrix and the target speech feature may be input to a decoding unit of the speech recognition model, and the reference matrix and the target speech feature may be subjected to weighted decoding in the decoding unit to generate a speech recognition result of the target speech data.

By applying the scheme of the embodiment of the specification, a time sequence classification unit of a voice recognition model is called, and target voice characteristics are classified to obtain a reference matrix corresponding to target voice data; and calling a decoding unit of the voice recognition model, and decoding the target voice features based on the reference matrix to obtain a voice recognition result of the target voice data. The voice time sequence unit assists the voice recognition model to better model voice data, so that the accuracy of voice recognition is ensured.

In an optional embodiment of the present disclosure, before the speech coding unit that invokes the speech recognition model encodes the target speech data to obtain the initial speech feature, the speech recognition model may be trained, that is, before the speech coding unit that invokes the speech recognition model encodes the target speech data to obtain the initial speech feature, the method may further include the following steps:

inputting a plurality of sample voice data and sample associated text data into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data;

Calculating speech recognition loss according to the sample recognition tag and the prediction recognition result;

calculating a keyword prediction loss according to the sample keyword label and the predicted keyword;

and adjusting model parameters of the voice recognition model according to the voice recognition loss and the keyword prediction loss to obtain the trained voice recognition model.

Specifically, the sample text label is a real label corresponding to the sample voice data, and can be also understood as a model prediction target.

It should be noted that, the implementation manner of "obtaining a plurality of sample voice data and sample associated text data corresponding to each sample voice data" is the same as the manner of "obtaining target voice data and associated text data corresponding to target voice data" described above, and the embodiments of this specification will not be repeated.

Further, model parameters of the voice recognition model are adjusted according to the voice recognition loss and the keyword prediction loss, when the trained voice recognition model is obtained, the voice recognition loss and the keyword prediction loss can be weighted to obtain total loss, and the model parameters of the voice recognition model are adjusted based on the total loss to obtain the trained voice recognition model.

In practical application, the speech recognition model further comprises a time sequence classification unit, and the sample text label further comprises a sample reference matrix label. The plurality of sample voice data and the sample associated text data are input into a voice recognition model, and a prediction reference matrix can be obtained in addition to a prediction recognition result and a prediction keyword corresponding to each sample voice data. And calculating the reference matrix prediction loss according to the sample reference matrix label and the prediction reference matrix, and adjusting model parameters of the speech recognition model according to the speech recognition loss, the keyword prediction loss and the reference matrix prediction loss to obtain the trained speech recognition model.

By applying the scheme of the embodiment of the specification, the model parameters of the voice recognition model are adjusted according to the voice recognition loss and the keyword prediction loss, so that the trained voice recognition model is obtained, and the voice recognition model has higher voice recognition capability and keyword recognition capability.

In an optional embodiment of the present disclosure, before inputting the plurality of sample voice data and the sample related text data into the voice recognition model to obtain the predicted recognition result and the predicted keyword corresponding to each sample voice data, the method may further include the following steps:

extracting keywords from the sample associated text data to obtain sample associated keywords;

inputting a plurality of sample voice data and sample associated text data into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data, the method may include the following steps:

and inputting the plurality of sample voice data and the sample associated keywords into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data.

It should be noted that, the method for extracting the keywords from the sample related text data to obtain the sample related keywords is various, and specifically, the method is selected according to the actual situation, which is not limited in the embodiment of the present disclosure. In one possible implementation manner of the present specification, a conventional unsupervised keyword extraction algorithm may be used to extract sample related keywords from sample related text data, including TF-IDF, textRank, and the like. In another possible implementation manner of the present disclosure, a keyword extraction model may be trained by using a supervised keyword extraction algorithm and using an already-labeled training corpus, and a sample associated keyword may be extracted from sample associated text data by using the keyword extraction model.

Further, when a plurality of sample voice data and sample related keywords are input into the voice recognition model, separation connection can be performed between the sample related keywords by using preset separation identifiers, and the connected sample related keywords are input into the voice recognition model. Further, disorder processing can be performed on the sample associated keywords to obtain the sample associated keywords after disorder; and carrying out separation connection on the disordered sample associated keywords according to a preset separation mark.

By applying the scheme of the embodiment of the specification, extracting keywords from the sample associated text data to obtain sample associated keywords; and inputting a plurality of sample voice data and sample related keywords into a voice recognition model to obtain a predicted recognition result and predicted keywords corresponding to each sample voice data, thereby improving the keyword prediction efficiency.

In practical applications, in limited-scale data, the number and length of keywords may be different, so as to avoid that the performance of the speech recognition model is affected by various keyword lists and avoid that the speech recognition model excessively depends on the limited keyword lists.

In an optional embodiment of the present disclosure, the updating of the sample related keywords may be performed based on a keyword simulation policy, that is, the extracting of the keywords from the sample related text data may further include the following steps after obtaining the sample related keywords:

word segmentation is carried out on the sample associated keywords, and a plurality of sub-word units of the sample associated keywords are determined;

and updating the sample associated keywords according to the preset separation mark and the plurality of sub-word units to obtain updated sample associated keywords.

Specifically, the subword units (BPE, byte Pair Encoding) are word segments of sample associated keywords. The method for word segmentation of the sample associated keywords is various, and is specifically selected according to practical situations, which is not limited in any way in the embodiment of the present specification. In one possible implementation manner of the present disclosure, a predefined sub-word dictionary may be utilized to match a sample associated keyword with a sub-word in the sub-word dictionary, where the matched sub-word is a sub-word unit. In another possible implementation manner of the present specification, a plurality of subword units of the sample associated keywords may be predicted by using a subword prediction model.

It should be noted that after determining the multiple sub-word units of the sample associated keyword, the complete sub-word unit of the sample associated keyword may be randomly selected and combined with the preset separation identifier to be used as the updated sample associated keyword.

Illustratively, assume that the sample association keyword is "intricate, chairs, source", and the updated sample association keyword is "_inter < blank > _pairs < blank > _source".

By applying the scheme of the embodiment of the specification, the sample associated keywords are segmented, and a plurality of sub-word units of the sample associated keywords are determined; and updating the sample associated keywords according to the preset separation mark and the plurality of sub-word units to obtain updated sample associated keywords. The voice recognition model can be easily popularized to various context keywords, so that more practical and stable voice recognition is realized, and the robustness and generalization of the model are enhanced through dynamically simulating the keywords.

In an optional embodiment of the present disclosure, the updating of the sample related keywords may be performed based on the sub-word unit simulation strategy of the keywords, that is, the extracting of the keywords from the sample related text data may further include the following steps after obtaining the sample related keywords:

randomly selecting at least one target subword unit from the plurality of subword units;

And updating the sample associated keywords according to the preset separation mark and at least one target sub-word unit to obtain updated sample associated keywords.

It should be noted that, after determining the multiple sub-word units of the sample associated keyword, at least one target sub-word unit may be randomly selected from the complete sub-word units of the sample associated keyword, and combined with the preset separation identifier to be used as the updated sample associated keyword. At least one target subword unit, such as in "integercat" is "_in".

Illustratively, assume that the sample association keyword is "intricate, going, portables, batter", and the updated sample association keyword is "< blank > _going < blank > _able < blank > _ter".

By applying the scheme of the embodiment of the specification, the sample associated keywords are segmented, and a plurality of sub-word units of the sample associated keywords are determined; randomly selecting at least one target subword unit from the plurality of subword units; and updating the sample associated keywords according to the preset separation mark and at least one target sub-word unit to obtain updated sample associated keywords. The voice recognition model can be easily popularized to various context keywords, so that more practical and stable voice recognition is realized. Also, since the speech recognition model is a fine-grained model modeled based on sub-word units, keyword a can be used to predict keyword B having the same prefix sub-word units as a, such as keyword "KATHERINE" having the same prefix "KATH" can help predict keyword "KATHY".

In an optional embodiment of the present disclosure, the inputting a plurality of sample voice data and sample related text data into a voice recognition model to obtain a predicted recognition result and a predicted keyword corresponding to each sample voice data may include the following steps:

for first sample voice data, calling a voice encoding unit, encoding the first sample voice data to obtain first sample voice characteristics, and calling a text encoding unit, and encoding first sample associated text data corresponding to the first sample voice data to obtain first sample text characteristics, wherein the first sample voice data is any one of a plurality of sample voice data;

invoking a first fusion unit to fuse the first sample voice feature to obtain a first fused voice feature, and invoking a second fusion unit to fuse the first sample voice feature to obtain a first fused text feature;

and determining a first prediction recognition result and a first prediction keyword corresponding to the first sample voice data according to the first fusion voice characteristic and the first fusion text characteristic.

It should be noted that the second fusion unit may be a fusion unit independent of the speech recognition model, or may be disposed in the speech recognition model. The second fusing unit may fuse the first sample speech feature to the first sample text feature based on different fusion strategies to obtain a first fused text feature. Fusion policies include, but are not limited to, feature level fusion policies, attention-based mechanism fusion policies. Taking attention mechanism based fusion policy as an example, the second fusion unit may be a Context-aware (CA) cross-attention unit for fusing the first sample speech feature to the first sample text feature based on a cross-attention mechanism to obtain a first fused text feature.

In practical application, the implementation mode of calling a voice coding unit to code first sample voice data to obtain first sample voice characteristics and calling a text coding unit to code first sample related text data corresponding to the first sample voice data to obtain first sample text characteristics is the same as the mode of calling a voice coding unit of a voice recognition model to code target voice data to obtain initial voice characteristics and calling a text coding unit of a voice recognition model to code related text data to obtain initial text characteristics; the implementation manner of "invoking the first fusion unit to fuse the first sample feature to the first sample voice feature to obtain the first fused voice feature, and invoking the second fusion unit to fuse the first sample voice feature to the first sample feature to obtain the first fused text feature" is the same as the implementation manner of "invoking the first fusion unit of the voice recognition model to fuse the initial text feature to the initial voice feature to obtain the target voice feature" described above, and the embodiments of this specification will not be repeated.

It is worth noting that the first sample speech feature may be generated by the following equation (1), and the first sample feature may be generated by the following equation (2):

H _a ＝Encoder _a (X) (1)

H _c ＝Encoder _c (C) (2)

Wherein, the Encoder represents the encoding operation, H _a Representing a first sample speech feature, H _c Representing a first sample feature, X representing first sample speech data, C representing first sample associated text data.

By applying the scheme of the embodiment of the specification, the voice coding unit, the text coding unit, the first fusion unit and the second fusion unit are utilized to determine the first sample voice feature, the first sample text feature, the first fusion voice feature and the first fusion text feature, and further, according to the first fusion voice feature and the first fusion text feature, a first prediction recognition result and a first prediction keyword corresponding to the first sample voice data are determined, and the voice recognition prediction performance and the keyword prediction performance are improved through feature fusion.

In an optional embodiment of the present disclosure, the invoking the first fusing unit to fuse the first sample feature to the first sample speech feature to obtain a first fused speech feature may include the following steps:

invoking a first fusion unit, taking the first sample voice feature as a target sample query vector, taking the first sample text feature as a target sample key vector and a target sample value vector, and calculating by using a cross attention mechanism to obtain a first fusion voice feature;

Invoking a second fusion unit to fuse the first sample speech feature to the first sample text feature to obtain a first fused text feature, comprising:

and calling a second fusion unit, taking the first text sample characteristic as a reference sample query vector, taking the first sample voice characteristic as a reference sample key vector and a reference sample value vector, and calculating by using a cross attention mechanism to obtain the first fusion text characteristic.

It should be noted that, the implementation manner of "taking the first sample voice feature as the target sample query vector, taking the first sample text feature as the target sample key vector and the target sample value vector, and calculating to obtain the first fused voice feature by using the cross attention mechanism" and "taking the first sample text feature as the reference sample query vector, taking the first sample voice feature as the reference sample key vector and the reference sample value vector, and calculating to obtain the first fused text feature by using the cross attention mechanism" is the same as the implementation manner of "taking the initial voice feature as the target query vector, taking the initial text feature as the target key vector and the target value vector, and calculating to obtain the target voice feature by using the cross attention mechanism" described above, and the embodiment of the present specification will not be repeated.

In practical application, the first fused speech feature may be generated by the following formula (3), and the first fused text feature may be generated by the following formula (4):

H _ac ＝CrossAttention _ac (H _a ,H _c ,H _c ) (3)

H _ca ＝CrossAttention _ca (H _c ,H _a ,H _a ) (4)

wherein CrossA elements _ac Cross-attention mechanism representing first fusion unit _ca Representing the cross-attention mechanism of the second fusion unit, H _ac Representing a first fused speech feature, H _ca Representing a first fused text feature.

By applying the scheme of the embodiment of the specification, through calculation of a cross attention mechanism, the first sample voice feature can be fused in the first sample voice feature, and the first sample voice feature is fused in the first sample voice feature, so that the voice recognition prediction performance and the keyword prediction performance are improved.

In an optional embodiment of the present disclosure, the speech recognition model further includes a keyword prediction unit; the determining, according to the first fused speech feature and the first fused text feature, the first predicted recognition result and the first predicted keyword corresponding to the first sample speech data may include the following steps:

inputting the first fusion voice characteristic into a decoding unit to determine a first prediction recognition result;

and inputting the first fusion text characteristic into a keyword prediction unit to determine a first predicted keyword.

Specifically, the decoding unit is used for decoding the fused voice features to obtain a prediction recognition result. A keyword prediction (biasing prediction) unit is used to explicitly predict keywords in speech.

In practical application, the first predictive recognition result may be generated by the following formula (5):

Y＝Decoder(H _ac ) (5)

wherein Y represents the first predictive recognition result, and Decoder represents the decoding operation, H _ac Representing a first fused speech feature.

By applying the scheme of the embodiment of the specification, the first fusion voice characteristic is input into a decoding unit to determine a first prediction recognition result; and inputting the first fusion text characteristic into a keyword prediction unit to determine a first predicted keyword. By predicting the keywords in the training process, the learning effect and recognition capability of the voice recognition model on the keywords are improved.

In an alternative embodiment of the present specification, the keyword prediction unit includes a self-attention layer, a feedforward layer, a convolution layer, and an output layer; the inputting the first fused text feature into the keyword prediction unit to determine the first predicted keyword may include the following steps:

inputting the first fusion text feature into a self-attention layer and a feedforward layer to perform global feature processing to obtain a global fusion text feature;

Inputting the global fusion text features into a convolution layer to perform local feature processing to obtain local fusion text features;

and inputting the local fusion text features into an output layer to obtain a first prediction keyword.

Specifically, the output layer includes a linear layer and a sigmoid activation function.

It should be noted that global and local interactions are important in the keyword prediction process. If most of the sub-word units within a keyword are keyword units, local modeling can effectively extract fine-grained local context patterns and determine all sub-word units of the entire keyword as sample associated keywords. For example, "_cross-criterion" is a sub-word unit of the keyword "cross-minimum", and "_cross#ial" helps to determine the sub-word unit "ist". In addition, keywords also contain context that can be enhanced by global modeling, e.g., "cross-minister" might be used to describe "places," and if the former is a keyword, the latter might also be a keyword. In order to integrate these considerations, in the keyword prediction unit, a global dependency relationship of the context representation is first established by using a self-attention mechanism in the self-attention layer and using the feedforward layer, so as to obtain a global fusion text feature. And then, transmitting the global fusion text characteristic to one-dimensional convolution in a convolution layer, and modeling a local dependency relationship in the convolution layer to obtain the local fusion text characteristic, wherein the size of a convolution window is set according to actual conditions. Since the number of subword units for most keywords does not exceed 5, the convolution window size may be 2, i.e., looking forward and backward at both units. And finally, inputting the local fusion text features into an output layer to obtain a first prediction keyword.

In practical application, the global fusion text feature can be generated through the following formula (6) and the following formula (7), the local fusion text feature can be generated through the following formula (8), and the first prediction keyword can be generated through the following formula (9):

H _att ＝SelfAttention(H _ca ) (6)

H _ffn ＝FeedForwar d(H _att ) (7)

H _cov ＝Conv(H _ffn ) (8)

α＝Sigmoid(linear(H _cov )) (9)

wherein H is _att Represents the output of the attention layer, H _ffn Global fused text feature representing feed-forward layer output, H _cov Representing local fusion text features of the output of the convolution layer, selfAttention representing the operation of the self-attention layer, feedForward representing the operation of the feed-forward layer, conv representing the operation of the convolution layer, H _ca Representing a first fused text feature, alpha representing a first predictive keyword, sigmoid representing an activation function, linear representing a linear layer operation.

By applying the scheme of the embodiment of the specification, the first fusion text feature is input into the self-attention layer and the feedforward layer to perform global feature processing, so as to obtain global fusion text features; inputting the global fusion text features into a convolution layer to perform local feature processing to obtain local fusion text features; and the local fusion text features are input into an output layer to obtain a first predicted keyword, so that the learning effect and recognition capability of the speech recognition model on the keyword are improved.

Referring to fig. 4, fig. 4 shows a flowchart of a method for training a speech recognition model according to an embodiment of the present disclosure, where the method for training a speech recognition model is applied to cloud-side equipment, and specifically includes the following steps:

Step 402: and acquiring a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label, and the sample text label comprises a sample identification label and a sample keyword label.

Step 404: and calling a voice coding unit of the voice recognition model, coding the sample voice data to obtain sample voice characteristics, and calling a text coding unit of the voice recognition model, and coding the sample associated text data to obtain sample text characteristics.

Step 406: and calling a first fusion unit of the voice recognition model to fuse the sample text features to the sample voice features to obtain sample fused voice features, and calling a second fusion unit to fuse the sample voice features to the sample text features to obtain sample fused text features.

Step 408: and determining a prediction recognition result and a prediction keyword corresponding to the sample voice data according to the sample fusion voice characteristics and the sample fusion text characteristics.

Step 410: and adjusting model parameters of the voice recognition model according to the sample recognition label, the sample keyword label, the prediction recognition result and the prediction keyword to obtain the trained voice recognition model.

It should be noted that, the implementation manners of steps 402 to 410 are the same as the training manner of the speech recognition model related to the above-mentioned speech recognition method, and the description of the embodiment of the present disclosure is omitted.

In practical application, the model parameters of the speech recognition model are adjusted according to the sample text labels, the prediction recognition results and the prediction keywords carried by the sample speech data, after the trained speech recognition model is obtained, the model parameters of the trained speech recognition model can be sent to the terminal side device, so that the speech recognition model is built locally based on the model parameters received by the terminal side device, and the built speech recognition model is further utilized for speech recognition.

By applying the scheme of the embodiment of the specification, the recognition effect of the voice recognition model on the keywords is improved by the mode based on the voice and text joint modeling, the keywords contained in the input text are further predicted in the model training process, the learning effect of the voice recognition model on the keywords is improved, and the model performance of the voice recognition model is improved.

It should be noted that the speech recognition model provided in the embodiments of the present disclosure performs performance experiments on multiple corpora. For the large-scale audio and video corpus containing slides, the relative Word Error Rate (WER), non-keyword Error Rate (U-WER, unigram Word Error Rate) and keyword Error Rate (B-WER, bigram Word Error Rate) of the speech recognition model are reduced by 8.4%, 7.7% and 16.3%, respectively, while having higher non-keyword and keyword performance. In addition, the experimental results of the English language corpus show that the relative word error rate, the non-keyword word error rate and the keyword word error rate of the voice recognition model are respectively reduced by 23.8%, 19.2% and 35.4%.

The following describes a voice recognition method provided in the present specification by taking an application of the voice recognition method in an intelligent conference scenario as an example with reference to fig. 5. Fig. 5 shows a flowchart of a processing procedure of a voice recognition method according to an embodiment of the present disclosure, which specifically includes:

slide acquisition: acquiring a slide from an online conference, wherein the slide comprises associated text data which are synchronized with target voice data in real time;

text data detection and recognition: detecting and identifying associated text data in the slide using optical character recognition techniques;

keyword extraction: extracting keywords from the associated text data to obtain a plurality of target keywords (keyword list);

and (3) voice recognition: and carrying out voice recognition on the plurality of target keywords and the target voice data to obtain a voice recognition result.

By applying the scheme of the embodiment of the specification, through the detection and recognition of text data, keyword extraction and a multi-mode voice recognition model based on voice-text, the voice recognition performance is improved by using the text data, so that the recognition effect of the voice recognition model on keywords can be remarkably improved, and the voice recognition model can also have good recognition improvement effect on other non-keywords due to the fact that the keywords provide rich context information.

In the embodiment of the specification, firstly, aiming at the problems that most of traditional voice recognition schemes utilize information of lip language of a speaker in a video and less text information is utilized, the voice recognition method relates to a scheme for improving the voice recognition performance of audio and video by utilizing text information in the video, realizes the transcription of language content by utilizing the information of the video, the text and the audio, and effectively utilizes the available long text information in the video as the input of a text coding unit; secondly, the voice recognition method relates to a hot word scheme based on voice and text joint modeling, and the recognition effect of the model on keywords is improved; in addition, the voice recognition method relates to a keyword prediction unit, and keyword prediction is carried out in the process of training a voice recognition model so as to improve the learning effect and recognition capability of the model on keywords.

Referring to fig. 6, fig. 6 shows a flowchart of a processing procedure of a method for training a speech recognition model according to an embodiment of the present disclosure, and specifically, the speech recognition model includes a speech encoding unit, a text encoding unit, a first fusion unit, a second fusion unit, a timing classification unit, a keyword prediction unit, and a decoding unit; the keyword prediction unit includes a self-attention layer, a feedforward layer, a convolution layer, and an output layer including a linear layer and a sigmoid activation function. The processing procedure of the speech recognition model training method comprises the following steps:

Speech encoding unit and text encoding unit: acquiring a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label, and the sample text label comprises a sample identification label and a sample keyword label; inputting the first sample voice data into a voice coding unit aiming at the first sample voice data to obtain first sample voice characteristics; extracting keywords from first sample associated text data corresponding to the first sample voice data to obtain a plurality of first sample associated keywords; connecting a plurality of first sample related keywords by using a preset separation mark to obtain updated first sample related keywords; inputting the updated first sample related keywords into a text coding unit to obtain first sample characteristics; if the first sample related keyword is null, inputting a preset separation identifier into the text coding unit;

a first fusion unit: inputting the first sample voice feature and the first sample text feature into a first fusion unit, wherein the first fusion unit takes the first sample voice feature as a target sample query vector, takes the first sample text feature as a target sample key vector and a target sample value vector, and calculates the first fusion voice feature by using a cross attention mechanism;

A second fusion unit: inputting the first sample voice feature and the first sample text feature into a second fusion unit, wherein the first sample text feature is used as a reference sample query vector, the first sample voice feature is used as a reference sample key vector and a reference sample value vector in the second fusion unit, and the first fusion text feature is calculated by using a cross attention mechanism;

a time sequence classification unit: inputting the first fusion voice characteristic into a time sequence classification unit to obtain a first prediction reference matrix;

keyword prediction unit: inputting the first fusion text feature into a self-attention layer and a feedforward layer to perform global feature processing to obtain a global fusion text feature; inputting the global fusion text features into a convolution layer to perform local feature processing to obtain local fusion text features; inputting the local fusion text features into an output layer to obtain a first prediction keyword;

decoding unit: inputting the first prediction reference matrix and the first fusion voice characteristic into a decoding unit to obtain a first prediction recognition result;

loss calculation: calculating speech recognition loss according to a first sample recognition tag carried by first sample speech data and a first prediction recognition result; calculating a keyword prediction loss according to a first sample keyword tag and a first prediction keyword carried by the first sample voice data; calculating a reference matrix prediction loss according to a first sample reference matrix tag and a first prediction reference matrix carried by first sample voice data;

Model parameter adjustment: and adjusting model parameters of the voice recognition model according to the voice recognition loss, the keyword prediction loss and the reference matrix prediction loss to obtain the trained voice recognition model.

By applying the scheme of the embodiment of the specification, after the text coding unit, the keywords appearing in the voice are definitely determined through keyword prediction loss (binary cross entropy loss), so that the text coding unit is allowed to effectively learn the relation among the keywords from the text content with long contexts, and the prediction accuracy of the keywords is improved. Since the keyword prediction loss is the classification loss, and whether the corresponding word is the keyword is judged, the calculation difficulty is lower, so that the model can learn which keywords are more easily, the prediction accuracy of the model on the keywords is further improved, and the convergence rate of the model is accelerated.

Referring to fig. 7, fig. 7 is an interface schematic diagram of a speech recognition interface according to an embodiment of the present disclosure. The voice recognition interface is divided into a request input interface and a result display interface. The request input interface includes a request input box, a "determine" control, and a "cancel" control. The result display interface comprises a result display frame.

The method comprises the steps that a user inputs a voice recognition request through a request input box displayed by a client, wherein the voice recognition request carries target voice data and associated text data corresponding to the target voice data, a 'determination' control is clicked, a server receives the target voice data sent by the client and the associated text data corresponding to the target voice data, a voice coding unit of a voice recognition model is called, initial voice characteristics are obtained by coding the target voice data, a text coding unit of the voice recognition model is called, the associated text data is coded to obtain initial text characteristics, and the voice recognition model is obtained by training based on voice recognition loss and keyword prediction loss; invoking a first fusion unit of the voice recognition model, and fusing the initial text features to the initial voice features to obtain target voice features; and calling a decoding unit of the voice recognition model, decoding the target voice characteristics to obtain a voice recognition result of the target voice data, and sending the voice recognition result to the client. And the client displays the voice recognition result in a result display frame.

In practical applications, the manner in which the user operates the control includes any manner such as clicking, double clicking, touch control, mouse hovering, sliding, long pressing, voice control or shaking, and the like, and the selection is specifically performed according to the practical situation, which is not limited in any way in the embodiments of the present disclosure.

Corresponding to the above-mentioned voice recognition method embodiment, the present disclosure further provides a voice recognition device embodiment, and fig. 8 shows a schematic structural diagram of a voice recognition device provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:

a first obtaining module 802 configured to obtain target voice data and associated text data corresponding to the target voice data;

the first encoding module 804 is configured to invoke a speech encoding unit of a speech recognition model, encode target speech data to obtain initial speech features, invoke a text encoding unit of the speech recognition model, and encode associated text data to obtain initial text features, wherein the speech recognition model is obtained based on speech recognition loss and keyword prediction loss training;

the first fusion module 806 is configured to invoke a first fusion unit of the speech recognition model to fuse the initial text feature to the initial speech feature, so as to obtain a target speech feature;

the decoding module 808 is configured to invoke a decoding unit of the speech recognition model to decode the target speech feature to obtain a speech recognition result of the target speech data.

Optionally, the first fusion module 806 is further configured to invoke a first fusion unit of the speech recognition model, and calculate the target speech feature using the cross-attention mechanism with the initial speech feature as the target query vector, the initial text feature as the target key vector, and the target value vector.

Optionally, the apparatus further comprises: the classification module is configured to call a time sequence classification unit of the voice recognition model, classify the target voice features to obtain a reference matrix corresponding to the target voice data, wherein the reference matrix characterizes the decoding path distribution probability of the target voice features; the decoding module 808 is further configured to invoke a decoding unit of the speech recognition model to decode the target speech feature based on the reference matrix to obtain a speech recognition result of the target speech data.

Optionally, the apparatus further comprises: the computing module is configured to acquire a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label, and the sample text label comprises a sample identification label and a sample keyword label; inputting a plurality of sample voice data and sample associated text data into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data; calculating speech recognition loss according to the sample recognition tag and the prediction recognition result; calculating a keyword prediction loss according to the sample keyword label and the predicted keyword; and adjusting model parameters of the voice recognition model according to the voice recognition loss and the keyword prediction loss to obtain the trained voice recognition model.

Optionally, the apparatus further comprises: the extraction module is configured to extract keywords from the sample associated text data to obtain sample associated keywords; the computing module is further configured to input the plurality of sample voice data and the sample association keywords into the voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data.

Optionally, the apparatus further comprises: the first word segmentation module is configured to segment the sample associated keywords and determine a plurality of sub-word units of the sample associated keywords; and updating the sample associated keywords according to the preset separation mark and the plurality of sub-word units to obtain updated sample associated keywords.

Optionally, the apparatus further comprises: the second word segmentation module is configured to segment the sample associated keywords and determine a plurality of sub-word units of the sample associated keywords; randomly selecting at least one target subword unit from the plurality of subword units; and updating the sample associated keywords according to the preset separation mark and at least one target sub-word unit to obtain updated sample associated keywords.

Optionally, the computing module is further configured to call the voice encoding unit for the first sample voice data, encode the first sample voice data to obtain a first sample voice feature, and call the text encoding unit for encoding the first sample associated text data corresponding to the first sample voice data to obtain a first sample text feature, where the first sample voice data is any one of the plurality of sample voice data; invoking a first fusion unit to fuse the first sample voice feature to obtain a first fused voice feature, and invoking a second fusion unit to fuse the first sample voice feature to obtain a first fused text feature; and determining a first prediction recognition result and a first prediction keyword corresponding to the first sample voice data according to the first fusion voice characteristic and the first fusion text characteristic.

Optionally, the computing module is further configured to invoke the first fusion unit, and compute the first fusion voice feature by using the cross attention mechanism with the first sample voice feature as a target sample query vector, the first sample text feature as a target sample key vector and a target sample value vector; and calling a second fusion unit, taking the first text sample characteristic as a reference sample query vector, taking the first sample voice characteristic as a reference sample key vector and a reference sample value vector, and calculating by using a cross attention mechanism to obtain the first fusion text characteristic.

Optionally, the speech recognition model further comprises a keyword prediction unit; the computing module is further configured to input the first fusion voice characteristic into the decoding unit to determine a first prediction recognition result; and inputting the first fusion text characteristic into a keyword prediction unit to determine a first predicted keyword.

Optionally, the keyword prediction unit includes a self-attention layer, a feedforward layer, a convolution layer, and an output layer; the computing module is further configured to input the first fusion text feature into the self-attention layer and the feedforward layer to perform global feature processing to obtain a global fusion text feature; inputting the global fusion text features into a convolution layer to perform local feature processing to obtain local fusion text features; and inputting the local fusion text features into an output layer to obtain a first prediction keyword.

According to the scheme applied to the embodiment of the specification, the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training, so that the voice recognition model has high keyword recognition capability, and the target voice features are obtained by fusing initial text features to the initial voice features, so that abundant contextual text information is fused into the target voice features, and the voice recognition performance is improved.

The above is an exemplary embodiment of a speech recognition apparatus of the present embodiment. It should be noted that, the technical solution of the voice recognition device and the technical solution of the voice recognition method belong to the same concept, and details of the technical solution of the voice recognition device, which are not described in detail, can be referred to the description of the technical solution of the voice recognition method.

Corresponding to the above embodiment of the method for training a speech recognition model, the present disclosure further provides an embodiment of a device for training a speech recognition model, and fig. 9 shows a schematic structural diagram of a device for training a speech recognition model according to one embodiment of the present disclosure.

As shown in fig. 9, a speech recognition model training apparatus is applied to cloud-side equipment, the apparatus including:

A second obtaining module 902, configured to obtain a plurality of sample voice data and sample associated text data corresponding to each sample voice data, where the sample voice data carries a sample text label, and the sample text label includes a sample identification label and a sample keyword label;

the second encoding module 904 is configured to invoke a voice encoding unit of the voice recognition model, encode the sample voice data to obtain sample voice features, invoke a text encoding unit of the voice recognition model, and encode the sample associated text data to obtain sample text features;

the second fusion module 906 is configured to invoke the first fusion unit of the speech recognition model, fuse the sample text features to the sample speech features to obtain sample fused speech features, and invoke the second fusion unit, fuse the sample speech features to the sample text features to obtain sample fused text features;

a determining module 908, configured to determine a predicted recognition result and a predicted keyword corresponding to the sample voice data according to the sample fusion voice feature and the sample fusion text feature;

an adjustment module 910 configured to adjust model parameters of the speech recognition model according to the sample recognition tag, the sample keyword tag, the predicted recognition result, and the predicted keyword to obtain a trained speech recognition model.

By means of the method and the device, the recognition effect of the voice recognition model on the keywords is improved through the mode based on voice and text joint modeling, the keywords contained in the input text are further predicted in the model training process, the learning effect of the voice recognition model on the keywords is improved, and the model performance of the voice recognition model is improved.

The foregoing is a schematic scheme of a speech recognition model training apparatus of this embodiment. It should be noted that, the technical solution of the speech recognition model training device and the technical solution of the above-mentioned speech recognition model training method belong to the same concept, and details of the technical solution of the speech recognition model training device which are not described in detail can be referred to the description of the technical solution of the above-mentioned speech recognition model training method.

FIG. 10 illustrates a block diagram of a computing device provided in one embodiment of the present description. The components of the computing device 1000 include, but are not limited to, a memory 1010 and a processor 1020. Processor 1020 is coupled to memory 1010 via bus 1030 and database 1050 is used to store data.

Computing device 1000 also includes access device 1040, which access device 1040 enables computing device 1000 to communicate via one or more networks 1060. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 1040 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network Interface Card), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Networks) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, world Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near-field communication (NFC, near Field Communication) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1000, as well as other components not shown in FIG. 10, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 10 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1000 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 1000 may also be a mobile or stationary server.

Wherein the processor 1020 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the speech recognition method or the speech recognition model training method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device belongs to the same concept as the technical solutions of the above-mentioned voice recognition method and the voice recognition model training method, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solutions of the above-mentioned voice recognition method or the voice recognition model training method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described speech recognition method or speech recognition model training method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solutions of the above-mentioned voice recognition method and voice recognition model training method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solutions of the above-mentioned voice recognition method or voice recognition model training method.

An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the above-described speech recognition method or speech recognition model training method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solutions of the above-mentioned voice recognition method and voice recognition model training method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solutions of the above-mentioned voice recognition method or voice recognition model training method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of speech recognition, comprising:

invoking a voice coding unit of a voice recognition model to code the target voice data to obtain initial voice characteristics, invoking a text coding unit of the voice recognition model to code the associated text data to obtain initial text characteristics, wherein the voice recognition model is obtained based on voice recognition loss and keyword prediction loss training;

invoking a first fusion unit of the voice recognition model to fuse the initial text feature to the initial voice feature to obtain a target voice feature;

and invoking a decoding unit of the voice recognition model to decode the target voice characteristics to obtain a voice recognition result of the target voice data.

2. The method of claim 1, the invoking the first fusion unit of the speech recognition model to fuse the initial text feature to the initial speech feature to obtain a target speech feature, comprising:

3. The method of claim 1, wherein the invoking the decoding unit of the speech recognition model further comprises, before decoding the target speech feature to obtain a speech recognition result of the target speech data:

the decoding unit for calling the voice recognition model decodes the target voice feature to obtain a voice recognition result of the target voice data, and the decoding unit comprises:

and invoking a decoding unit of the voice recognition model, and decoding the target voice features based on the reference matrix to obtain a voice recognition result of the target voice data.

4. The method of claim 1, wherein the speech encoding unit that invokes a speech recognition model, before encoding the target speech data to obtain the initial speech feature, further comprises:

acquiring a plurality of sample voice data and sample associated text data corresponding to each sample voice data, wherein the sample voice data carries a sample text label which comprises a sample identification label and a sample keyword label;

Inputting the plurality of sample voice data and the sample associated text data into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data;

calculating a keyword prediction loss according to the sample keyword label and the prediction keyword;

and adjusting model parameters of the voice recognition model according to the voice recognition loss and the keyword prediction loss to obtain a trained voice recognition model.

5. The method according to claim 4, wherein before inputting the plurality of sample voice data and the sample associated text data into a voice recognition model to obtain the predicted recognition result and the predicted keyword corresponding to each sample voice data, the method further comprises:

inputting the plurality of sample voice data and the sample associated text data into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data, wherein the method comprises the following steps:

and inputting the plurality of sample voice data and the sample related keywords into a voice recognition model to obtain a prediction recognition result and a prediction keyword corresponding to each sample voice data.

6. The method of claim 5, wherein the extracting the keywords from the sample associated text data, after obtaining the sample associated keywords, further comprises:

performing word segmentation on the sample associated keywords, and determining a plurality of sub-word units of the sample associated keywords;

and updating the sample associated keywords according to a preset separation mark and the plurality of sub-word units to obtain updated sample associated keywords.

7. The method of claim 5, wherein the extracting the keywords from the sample associated text data, after obtaining the sample associated keywords, further comprises:

and updating the sample associated keywords according to a preset separation mark and the at least one target sub-word unit to obtain updated sample associated keywords.

8. The method according to claim 4, wherein inputting the plurality of sample voice data and the sample associated text data into a voice recognition model to obtain a predicted recognition result and a predicted keyword corresponding to each sample voice data, comprises:

Invoking the voice encoding unit for first sample voice data, encoding the first sample voice data to obtain a first sample voice feature, and invoking the text encoding unit for encoding first sample associated text data corresponding to the first sample voice data to obtain a first text sample feature, wherein the first sample voice data is any one of the plurality of sample voice data;

invoking the first fusion unit to fuse the first sample voice feature to obtain a first fused voice feature, and invoking the second fusion unit to fuse the first sample voice feature to obtain a first fused text feature;

and determining a first prediction recognition result and a first prediction keyword corresponding to the first sample voice data according to the first fusion voice feature and the first fusion text feature.

9. The method of claim 8, the invoking the first fusing unit to fuse the first sample features to the first sample speech features to obtain first fused speech features, comprising:

invoking the first fusion unit, taking the first sample voice feature as a target sample query vector, taking the first sample text feature as a target sample key vector and a target sample value vector, and calculating by using a cross attention mechanism to obtain a first fusion voice feature;

The calling a second fusing unit to fuse the first sample voice feature to the first sample text feature to obtain a first fused text feature, including:

and calling a second fusion unit, taking the first sample text feature as a reference sample query vector, taking the first sample voice feature as a reference sample key vector and a reference sample value vector, and calculating by using a cross attention mechanism to obtain the first fusion text feature.

10. The method of claim 8, the speech recognition model further comprising a keyword prediction unit;

the determining, according to the first fused speech feature and the first fused text feature, a first prediction recognition result and a first prediction keyword corresponding to the first sample speech data includes:

inputting the first fusion voice characteristic into the decoding unit, and determining a first prediction recognition result;

and inputting the first fusion text characteristic into the keyword prediction unit to determine a first predicted keyword.

11. The method of claim 10, the keyword prediction unit comprising a self-attention layer, a feed-forward layer, a convolution layer, and an output layer;

the step of inputting the first fused text feature into the keyword prediction unit to determine a first predicted keyword includes:

Inputting the first fusion text feature into the self-attention layer and the feedforward layer to perform global feature processing to obtain a global fusion text feature;

inputting the global fusion text features into the convolution layer to perform local feature processing to obtain local fusion text features;

and inputting the local fusion text features into the output layer to obtain a first prediction keyword.

12. A speech recognition model training method is applied to cloud side equipment and comprises the following steps:

invoking a voice encoding unit of a voice recognition model, encoding the sample voice data to obtain sample voice characteristics, invoking a text encoding unit of the voice recognition model, and encoding the sample associated text data to obtain sample text characteristics;

invoking a first fusion unit of the voice recognition model, fusing the sample text features to the sample voice features to obtain sample fusion voice features, and invoking a second fusion unit, fusing the sample voice features to the sample text features to obtain sample fusion text features;

and adjusting model parameters of the voice recognition model according to the sample recognition label, the sample keyword label, the prediction recognition result and the prediction keyword to obtain a trained voice recognition model.

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 11 or claim 12.

14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 11 or claim 12.