CN113889089A

CN113889089A - Method and device for acquiring voice recognition model, electronic equipment and storage medium

Info

Publication number: CN113889089A
Application number: CN202111150454.3A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The disclosure provides a method and a device for acquiring a voice recognition model, electronic equipment and a storage medium, and relates to the fields of natural voice understanding, voice technology, intelligent customer service and voice transcription. The specific implementation scheme is as follows: acquiring a plurality of groups of label data, wherein each group of data in the plurality of groups of label data comprises: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects; and training the neural network model by machine learning by using multiple groups of label data to obtain a voice recognition model. The method and the device solve the technical problem that the voice recognition effect of the voice recognition model in the related technology is poor.

Description

Method and device for acquiring voice recognition model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the fields of natural speech understanding, speech technology, intelligent customer service, and speech transcription, and in particular, to a method and an apparatus for acquiring a speech recognition model, an electronic device, and a storage medium.

Background

Most of the common speech recognition methods in the related art are to separate the speakers of the audio frequency first, and then to perform speech transcription on the separated audio frequency to obtain the characters of the corresponding speakers after being distinguished.

However, in the case of overlapping speakers, the existing speaker recognition system and voice transcription system have unsatisfactory voice separation effect and recognition accuracy, and the number of speakers needs to be set in advance to determine the number of branches of the network, so that the voice recognition effect is not good when the number of speakers changes.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The disclosure provides an acquisition method for a voice recognition model, a voice recognition method, a voice recognition device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a method for acquiring a speech recognition model, including: acquiring a plurality of groups of label data, wherein each group of data in the plurality of groups of label data comprises: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects; and training the neural network model by machine learning by using multiple groups of label data to obtain a voice recognition model.

Optionally, the method further includes: extracting the feature vector of the audio sample data by adopting a multilayer time delay neural network, and extracting to obtain a first feature vector of the sample object through multiple training iterations; selecting a preset number of the first feature vectors; and generating the sample object set according to the selected first feature vector.

Optionally, the speech recognition model is obtained by training in the following way: extracting a plurality of second feature vectors in the audio sample data by adopting an object feature encoder in the neural network model; extracting a plurality of third feature vectors in the audio sample data by adopting a content feature encoder; and training the neural network model based on the second feature vector and the third feature vector to obtain the voice recognition model.

Optionally, the extracting, by using the object feature encoder in the neural network model, a plurality of second feature vectors in the audio sample data includes: performing framing processing on the audio sample data in the neural network model to obtain a plurality of audio frames; extracting a normal distribution feature of each of the plurality of audio frames, wherein the normal distribution feature includes: static features, first order difference features, second order difference features; and inputting the normal distribution characteristics of the plurality of audio frames to the object characteristic encoder to obtain a plurality of second characteristic vectors.

Optionally, the training the neural network model based on the second feature vector and the third feature vector to obtain the speech recognition model includes: calculating a first importance coefficient corresponding to each second feature vector and a second importance coefficient corresponding to each third feature vector by using an attention module in the neural network model; calculating a fourth feature vector based on the second feature vector and the first importance coefficient, and calculating a fifth feature vector based on the third feature vector and the second importance coefficient; and training the neural network model based on the fourth feature vector and the fifth feature vector to obtain the voice recognition model.

Optionally, the method further includes: processing the first decoded text and the fourth feature vector by adopting a target query model in the neural network model to obtain a sixth feature vector of the sample object; calculating correlation degree values between the sixth feature vector and a plurality of sample objects in the sample object set by using the attention module; determining a seventh feature vector of the sample object set based on the correlation metric value.

Optionally, the training the neural network model based on the fourth feature vector and the fifth feature vector to obtain the speech recognition model includes: acquiring an eighth eigenvector output by a semantic decoder in the neural network model for processing the first decoded text; decoding the fifth feature vector, the seventh feature vector and the eighth feature vector by using a content decoder in the neural network model to obtain a second decoded text, wherein the second decoded text is a decoded text at a next moment of the first decoded text; and calculating the cross entropy loss between the first decoded text and the second decoded text by adopting a minimum classification error algorithm so as to update the network parameters of the neural network model and obtain the voice recognition model.

According to an aspect of the present disclosure, there is provided a speech recognition method including: acquiring audio data to be recognized, wherein the audio data to be recognized comprises conversation contents of a plurality of target objects; inputting the audio data to be recognized into a speech recognition model, wherein the speech recognition model is obtained by training a neural network model through machine learning by using a plurality of groups of tag data, and each group of data in the plurality of groups of tag data comprises: audio sample data of the sample object, and a sample object set obtained by extracting and processing the characteristic vector of the audio sample data; and receiving a voice recognition processing result returned by the voice recognition model, wherein the voice recognition processing result distinguishes the audio content of each target object and the character information corresponding to the audio content.

Optionally, the acquiring the audio data to be identified includes: acquiring initial audio data; preprocessing the initial audio data to obtain the audio data to be identified, wherein the preprocessing comprises at least one of the following steps: removing silence, data enhancement, changing audio rate, time warping, frequency masking, text corpus processing.

According to another aspect of the present disclosure, there is provided an apparatus for acquiring a speech recognition model, including: an obtaining unit, configured to obtain multiple sets of tag data, where each set of data in the multiple sets of tag data includes: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects; and the training unit is used for training the neural network model through machine learning by using a plurality of groups of label data to obtain a voice recognition model.

According to another aspect of the present disclosure, there is provided a voice recognition apparatus including: the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring audio data to be recognized, and the audio data to be recognized comprises conversation contents of a plurality of target objects; a transmission module, configured to input the audio data to be recognized into a speech recognition model, where the speech recognition model is obtained by training a neural network model through machine learning using multiple sets of tag data, and each set of data in the multiple sets of tag data includes: audio sample data of the sample object, and a sample object set obtained by extracting and processing the characteristic vector of the audio sample data; and the receiving module is used for receiving a voice recognition processing result returned by the voice recognition model, wherein the voice recognition processing result distinguishes the audio content of each target object and the character information corresponding to the audio content.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform any one of the above-described speech recognition model acquisition methods or the above-described speech recognition method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute any one of the above-described methods for acquiring a speech recognition model or the above-described speech recognition method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements any one of the above-mentioned methods of obtaining a speech recognition model, or the above-mentioned speech recognition method.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flowchart illustrating steps of a method for obtaining a speech recognition model according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a multi-speaker separated speech recognition system according to a first embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of steps of a speech recognition method according to a first embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an apparatus for acquiring a speech recognition model according to a second embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present disclosure;

fig. 6 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present disclosure, there is provided an embodiment of a method for obtaining a speech recognition model, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a schematic flowchart of steps of a method for acquiring a speech recognition model according to a first embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:

step S102, obtaining a plurality of groups of label data, wherein each group of data in the plurality of groups of label data comprises: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects;

and step S104, training a neural network model through machine learning by using a plurality of groups of label data to obtain a voice recognition model.

In the embodiment of the present disclosure, a sample object set is obtained by obtaining a plurality of sets of tag data, extracting audio sample data of a sample object in the plurality of sets of tag data, and performing feature vector extraction processing on the audio sample data; and training a neural network model through machine learning by using a plurality of groups of label data to obtain a voice recognition model.

The sample object set, i.e., the speaker vector set, is obtained by performing feature vector extraction processing on audio sample data corresponding to a plurality of sample objects.

In the embodiment of the present disclosure, by acquiring multiple sets of tag data, each set of data in the multiple sets of tag data includes: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects; and training the neural network model by machine learning by using multiple groups of label data to obtain a voice recognition model.

Through this disclosed embodiment, adopt and acquire multiunit label data, wherein, every group data in above-mentioned multiunit label data includes: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects; a method for training a neural network model through machine learning by using a plurality of groups of label data to obtain a voice recognition model; the accuracy of speaker feature extraction is further improved; in addition, the whole system is an end-to-end structure, the training and testing targets are consistent, and errors possibly caused by mismatching are reduced.

Because the audio sample data comprises the dialogue contents of a plurality of sample objects in a plurality of groups of label data used for training the speech recognition model; after the voice recognition model is obtained, the voice recognition model can be adopted to accurately extract the feature vector data of different speakers in the audio data to be recognized, so that the purposes of distinguishing the audio contents of different speakers and the text information corresponding to the audio contents are achieved, the technical effects of assisting the recognition of the audio data and the recognition of the speakers by utilizing the feature vector data are achieved, the voice recognition accuracy and the voice separation accuracy of the voice recognition model are improved, and the technical problem that the voice recognition model in the related technology is poor in voice recognition effect is solved.

As an alternative embodiment, the speech recognition model is a multi-speaker separated speech recognition model, and mainly includes two parts: an end-to-end module for speaker separation in conjunction with semantic information, and an end-to-end speech recognition module that combines voiceprint features and semantic features.

Optionally, as shown in fig. 2, the schematic diagram of the structure of the voice recognition system with separated multiple speakers, X represents the audio to be recognized input by the network module; d represents a set of pre-trained speaker vectors, each representing a speaker, D_nRepresenting the corresponding speaker vector; beta is a_nK represents a group related to Q_nCoefficient of similarity of features to the kth feature in speaker set D, α_nRepresenting the importance magnitude of each feature vector; h^encRepresenting a content feature vector, H^spkRepresenting a speaker feature vector; y denotes the text result of the decoded output, Y_nIndicating the currently decoded text result, Y_n-1Representing the text result decoded at the last moment, C_nRepresenting a content context feature vector, P_nHigh-level feature vector, U, representing a speaker_nRepresenting semantic feature vectors, Q_nRepresenting a speaker query vector.

Optionally, in an end-to-end module for speaker separation of joint semantic information, the speaker feature encoder speaker encoder is composed of 5 layers of a multilayer delay neural network tdnn (time delay neural network), the speaker query model speaker query rnn is composed of 1 layer of a long-short term memory network LSTM, and the speaker collective attention model InventoryAttention is composed of a single-head attention module layer attention.

Optionally, the end-to-end module can accurately depict speaker characteristics in the audio, and the audio passes through the speaker characteristic encoder to obtain a plurality of speaker characteristic vectors H^spkThen, the importance of each vector is calculated by an attention module, and the vectors are added and summed to obtain a vector p_nCombined speech recognition content decoding model DecoderOut to obtain text information, and obtaining a speaker query vector q through a speaker query model_nAt this time, a speaker query vector q is calculated by the attention module in combination with a given speaker vector set D_nThe correlation size between the speaker vector set D samples and the speaker feature vector D finally synthesized according to the coefficient_n(ii) a The set D is trained in advance, through a multilayer time delay neural network, through a plurality of training iterations, and then the middle hidden layer is extracted as the feature vector of the speaker, wherein the set is composed of M representative speakers; the whole set can synthesize the feature vectors d of different speaker characteristics through different coefficients_nTherefore, in the actual use process, the number of speakers does not need to be set in advance to determine the branch number of the network.

It should be noted that, in the process of calculating the importance of each vector by the attention module, the importance coefficient of each vector to the feature is calculated according to the attention mechanism, that is, the influence of the vector to the feature is specifically represented by floating point coefficients such as 0.1, and the coefficients are multiplied by the corresponding feature vector H, and the obtained products are added and summed to obtain p_n(ii) a Synthesizing the final speaker feature vector d_nIn the process, an attention mechanism is adopted to calculate the speaker query vector q_nAnd multiplying the coefficient by the corresponding vector in the D set to obtain the speaker characteristic vector D_n。

Optionally, in an end-to-end speech recognition module combining voiceprint features and semantic features, the asrnecoder of audio content features is formed by a 5-layer long-short term memory network BLSTM, the semantic decoding model DecoderRNN is formed by a 2-layer LSTM, the content decoding model DecoderOut is formed by a 1-layer LSTM, and the attention module is formed by a multi-head.

Optionally, the core of the module is to accurately identify the text content in the audio, and the audio passes through an audio content feature encoder asrnecoder to obtain a plurality of content feature vectors H^encThen, the importance of each vector is calculated by an attention module, and the vectors are added and summed to obtain a vector c_nVector u output by combined semantic network semantic decoding model_nAndspeaker feature vector d_nJointly decoding out the text information y_nThe sc mark in the text information refers to a speaker change point, namely a speaker change point, and eos represents an end of a sensor sentence; the loss function of the whole network training is MCE (Minimum cross entropy); the long and short term memory network BLSTM is selected for construction, the relevance of the features with larger time span can be learned, the attention module pays attention to the relationship of the importance among the features, the more important features are highlighted and enhanced, and the features with stronger identifiability are extracted.

It should be noted that the attention module calculates the importance of each vector, and adds and sums the importance to obtain the vector c_nIs referred to as pair c_nVector u_nVector and speaker vector d_nSplicing, weighting and summing are carried out, and the text information y is obtained by decoding together_n。

In an optional embodiment, the method further includes:

step S202, extracting the feature vector of the audio sample data by adopting a multilayer time delay neural network, and extracting to obtain a first feature vector of the sample object through multiple training iterations;

step S204, selecting a preset number of the first feature vectors;

in step S206, the sample object set is generated according to the selected first feature vector.

In the embodiment of the disclosure, a multilayer time delay neural TDNN is adopted to perform feature vector extraction processing on the collected audio sample data, and after extraction is completed, a first feature vector of the sample object is obtained through extraction through multiple training iterations; selecting a preset number of the first feature vectors; and generating the sample object set according to the selected first feature vector.

It should be noted that the sample object set is trained in advance, and the set is composed of representative M speakers; the whole set can synthesize the feature vectors of different speaker characteristics through different coefficients, so that the number of speakers does not need to be set in advance in the actual use process to determine the branch number of the network.

In an alternative embodiment, the speech recognition model is trained as follows:

step S302, extracting a plurality of second feature vectors in the audio sample data by adopting an object feature encoder in the neural network model;

step S304, a content feature encoder is adopted to extract a plurality of third feature vectors in the audio sample data;

step S306, training the neural network model based on the second feature vector and the third feature vector to obtain the speech recognition model.

In the embodiment of the disclosure, an object feature encoder (speaker feature encoder) speaker ecoder and a content feature encoder AsrrEncoder in a neural network model are adopted to extract a second feature vector, namely a speaker feature vector H, in the audio sample data^spkAnd a third feature vector, content feature vector H^enc(ii) a And training the neural network model based on the second feature vector and the third feature vector to obtain the voice recognition model.

In an optional embodiment, the extracting, by using the object feature encoder in the neural network model, a plurality of second feature vectors in the audio sample data includes:

step S402, performing framing processing on the audio sample data in the neural network model to obtain a plurality of audio frames;

step S404, extracting a normal distribution feature of each of the plurality of audio frames, where the normal distribution feature includes: static features, first order difference features, second order difference features;

step S406 is to input the normal distribution features of the plurality of audio frames to the object feature encoder to obtain a plurality of second feature vectors.

In the embodiment of the disclosure, audio is framed, a normal distribution feature is extracted from each audio frame, and a plurality of normalities of the audio frames are extractedThe distribution characteristics are input into the object characteristic encoder to obtain a plurality of second characteristic vectors, namely speaker characteristic vector H^spk。

It should be noted that, the audio is framed, features such as 80-dimensional speech feature parameters MFCC, PLP, Fbank, etc. are extracted from each frame, and then normalized through first-order second-order difference, that is, normal distribution is regular, so that 80 × 3-dimensional features are extracted from each frame of audio, which are respectively called static, first-order difference, and second-order difference features.

In an optional embodiment, the training the neural network model based on the second feature vector and the third feature vector to obtain the speech recognition model includes:

step S502, calculating a first importance coefficient corresponding to each second feature vector and a second importance coefficient corresponding to each third feature vector by using an attention module in the neural network model;

step S504 of calculating a fourth feature vector based on the second feature vector and the first importance coefficient, and calculating a fifth feature vector based on the third feature vector and the second importance coefficient;

step S506, training the neural network model based on the fourth feature vector and the fifth feature vector to obtain the speech recognition model.

In the embodiment of the disclosure, the attention module in the neural network model is adopted to calculate the feature vector of each speaker and the first importance coefficient and the second importance coefficient corresponding to the content feature vector; calculating to obtain a fourth feature vector p based on the speaker feature vector and the first importance coefficient_nI.e. the high-level feature vector of the speaker, a fifth feature vector c is calculated based on the content feature vector and the second importance coefficient_nI.e. content context feature vectors; and training the neural network model based on the fourth feature vector and the fifth feature vector to obtain the voice recognition model.

The first importance coefficient corresponds to an importance coefficient of a speaker feature vector, and the second importance coefficient corresponds to an importance coefficient of a content feature vector.

In an optional embodiment, the method further includes:

step S602, processing the first decoded text and the fourth feature vector by using a target query model in the neural network model to obtain a sixth feature vector of the sample object;

step S604, calculating correlation degree values between the sixth feature vector and a plurality of sample objects in the sample object set by using the attention module;

step S606 determines a seventh feature vector of the sample object set based on the correlation degree value.

In the embodiment of the present disclosure, a target query model (speaker query model) in the neural network model is adopted to decode a text result at the last time and a high-level feature vector of a speaker; processing to obtain sixth feature vector of the sample object, i.e. speaker query vector q_n。

Optionally, the attention module is used to calculate correlation degree values between the speaker query vector and the plurality of sample objects in the sample object set; computing speaker query vector q using attention mechanism_nAnd multiplying the coefficient by the corresponding vector in the D set to obtain a seventh feature vector, namely, a speaker feature vector D_n。

In an optional embodiment, the training the neural network model based on the fourth feature vector and the fifth feature vector to obtain the speech recognition model includes:

step S702, acquiring an eighth eigenvector output by a semantic decoder in the neural network model for processing the first decoded text;

step S704, decoding the fifth feature vector, the seventh feature vector, and the eighth feature vector by using a content decoder in the neural network model to obtain a second decoded text, where the second decoded text is a decoded text at a next time of the first decoded text;

step S706, calculating the cross entropy loss between the first decoded text and the second decoded text by using a minimum classification error algorithm, so as to update the network parameters of the neural network model, thereby obtaining the speech recognition model.

In the embodiment of the present disclosure, a text Y decoded at the last time by a semantic decoding model decoderrn in the neural network model is obtained_n-1The result is processed and the eighth feature vector, namely the semantic feature vector u, is output_n(ii) a Adopting a content decoding model DecodeOut in the neural network model to process the fifth feature vector C_nThe seventh feature vector d_nAnd the above-mentioned eighth eigenvector u_nDecoding to obtain a second decoded text Y_nWherein the second decoded text is the text Y decoded at the previous moment_n-1The decoded text at the next time; and calculating the minimum cross entropy Loss and the reverse transfer error through a minimum classification error algorithm MCE, and updating the network parameters to obtain the voice recognition model.

Fig. 3 is a schematic flowchart of steps of a speech recognition method according to a first embodiment of the present disclosure, as shown in fig. 3, the method includes the following steps:

step S802, acquiring audio data to be recognized, wherein the audio data to be recognized comprises conversation contents of a plurality of target objects;

step S804, inputting the audio data to be recognized into a speech recognition model, where the speech recognition model is obtained by training a neural network model through machine learning using multiple sets of tag data, and each set of data in the multiple sets of tag data includes: audio sample data of the sample object, and a sample object set obtained by extracting and processing the characteristic vector of the audio sample data;

step S806, receiving a speech recognition processing result returned by the speech recognition model, wherein the speech recognition processing result distinguishes audio content of each target object and text information corresponding to the audio content.

In the embodiment of the present disclosure, a voice recognition processing result is obtained by acquiring audio data to be recognized and transmitting the audio data to be recognized to a voice recognition model and processing the audio data to be recognized by using the voice recognition model.

It should be noted that the audio data to be recognized includes the dialog contents of a plurality of target objects; the speech recognition model is obtained by training a neural network model through machine learning by using a plurality of groups of label data, and each group of data in the plurality of groups of label data comprises: audio sample data of the sample object, and a sample object set obtained by extracting and processing the characteristic vector of the audio sample data; the voice recognition processing result distinguishes the audio content of each target object and the character information corresponding to the audio content.

In an optional embodiment, the acquiring the audio data to be recognized includes:

step S902, obtaining initial audio data;

step S904, performing a preprocessing on the initial audio data to obtain the audio data to be recognized, where the preprocessing includes at least one of: removing silence, data enhancement, changing audio rate, time warping, frequency masking, text corpus processing.

In the embodiment of the disclosure, a certain amount of audio data of multi-person conversation is collected, that is, the input audio of the network module, the audio data of the multi-person conversation contains a small amount of aliasing, speaker labeling is performed on the audio, and corresponding text content is obtained, and then preprocessing is performed on the audio, including removing silence, performing data enhancement, for example, aliasing other voices with low volume, environmental noise, changing audio rate, performing time warping, frequency masking and the like; carrying out object labeling processing on the initial audio data to obtain a text corpus of the initial audio data; preprocessing the text corpus, for example, performing token cleansing normalization on the text corpus, including: removing special symbols, such as @%, etc., regular numerical unit symbols, such as: 150. 2010, kg, etc., and word segmentation, followed by pre-training of the semantic decoding model.

Example 2

According to an embodiment of the present disclosure, there is further provided an apparatus embodiment for implementing the foregoing speech recognition method, fig. 4 is a schematic structural diagram of an apparatus for acquiring a speech recognition model according to a second embodiment of the present disclosure, and as shown in fig. 4, the apparatus for acquiring a speech recognition model includes: an acquisition unit 40 and a training unit 42, wherein:

an obtaining unit 40, configured to obtain multiple sets of tag data, where each set of data in the multiple sets of tag data includes: the audio sample data of the sample object and the sample object set obtained by extracting the feature vector of the audio sample data are obtained, wherein the audio sample data comprise the dialogue contents of a plurality of sample objects;

and the training unit 42 is used for training the neural network model through machine learning by using the plurality of groups of label data to obtain a voice recognition model.

It should be noted that the above modules may be implemented by software or hardware, for example, for the latter, the following may be implemented: the modules can be located in the same processor; alternatively, the modules may be located in different processors in any combination.

It should be noted here that the above-mentioned acquiring unit 40 and the training unit 42 correspond to steps S102 to S104 in embodiment 1, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of embodiment 1. It should be noted that the modules described above may be implemented in a computer terminal as part of an apparatus.

It should be noted that, reference may be made to the relevant description in embodiment 1 for alternative or preferred embodiments of this embodiment, and details are not described here again.

The speech recognition apparatus may further include a processor and a memory, the obtaining unit 40, the training unit 42, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

According to an embodiment of the present disclosure, there is further provided an apparatus embodiment for implementing the foregoing speech recognition method, fig. 5 is a schematic structural diagram of a speech recognition apparatus according to a second embodiment of the present disclosure, and as shown in fig. 5, the speech recognition apparatus includes: an acquisition module 50, a transmission module 52, and a reception module 54, wherein:

an obtaining module 50, configured to obtain audio data to be identified, where the audio data to be identified includes session contents of a plurality of target objects;

a transmission module 52, configured to input the audio data to be recognized into a speech recognition model, where the speech recognition model is obtained by training a neural network model through machine learning using multiple sets of tag data, and each set of data in the multiple sets of tag data includes: audio sample data of the sample object, and a sample object set obtained by extracting and processing the characteristic vector of the audio sample data;

a receiving module 54, configured to receive a speech recognition processing result returned by the speech recognition model, where the speech recognition processing result distinguishes audio content of each target object and text information corresponding to the audio content.

It should be noted that the acquiring module 50, the transmitting module 52 and the receiving module 54 correspond to steps S802 to S806 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above may be implemented in a computer terminal as part of an apparatus.

The voice recognition apparatus may further include a processor and a memory, and the acquiring module 50, the transmitting module 52, the receiving module 54, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory, wherein one or more than one kernel can be arranged. The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 6 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, for example, the method acquires audio data to be recognized. For example, in some embodiments, the method of obtaining audio data to be identified may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the method described above for obtaining audio data to be identified may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method to obtain the audio data to be identified by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for acquiring a speech recognition model comprises the following steps:

acquiring a plurality of sets of label data, wherein each set of data in the plurality of sets of label data comprises: the method comprises the steps of obtaining audio sample data of a sample object and a sample object set obtained by extracting and processing a feature vector of the audio sample data, wherein the audio sample data comprises conversation contents of a plurality of sample objects;

and training the neural network model by machine learning by using multiple groups of label data to obtain a voice recognition model.

2. The method of claim 1, wherein the method further comprises:

extracting the feature vector of the audio sample data by adopting a multilayer time delay neural network, and extracting to obtain a first feature vector of the sample object through multiple training iterations;

selecting a preset number of the first feature vectors;

and generating the sample object set according to the selected first feature vector.

3. The method of claim 1, wherein the speech recognition model is trained by:

extracting a plurality of second feature vectors in the audio sample data by adopting an object feature encoder in the neural network model;

extracting a plurality of third feature vectors in the audio sample data by adopting a content feature encoder;

and training the neural network model to obtain the voice recognition model based on the second feature vector and the third feature vector.

4. The method of claim 3, wherein said extracting, with an object feature encoder in the neural network model, a plurality of second feature vectors in the audio sample data comprises:

performing framing processing on the audio sample data in the neural network model to obtain a plurality of audio frames;

extracting a normal distribution feature of each of the plurality of audio frames, wherein the normal distribution feature comprises: static features, first order difference features, second order difference features;

and inputting the normal distribution characteristics of the audio frames into the object characteristic encoder to obtain a plurality of second characteristic vectors.

5. The method of claim 3, wherein the training the neural network model to obtain the speech recognition model based on the second feature vector and the third feature vector comprises:

calculating a first importance coefficient corresponding to each second feature vector and a second importance coefficient corresponding to each third feature vector by adopting an attention module in the neural network model;

calculating to obtain a fourth feature vector based on the second feature vector and the first importance coefficient, and calculating to obtain a fifth feature vector based on the third feature vector and the second importance coefficient;

and training the neural network model to obtain the voice recognition model based on the fourth feature vector and the fifth feature vector.

6. The method of claim 5, wherein the method further comprises:

processing the first decoded text and the fourth feature vector by adopting a target query model in the neural network model to obtain a sixth feature vector of the sample object;

calculating, with the attention module, a degree of correlation value between the sixth feature vector and a plurality of sample objects in the set of sample objects;

determining a seventh feature vector of the set of sample objects based on the correlation metric values.

7. The method of claim 6, wherein the training the neural network model to obtain the speech recognition model based on the fourth feature vector and the fifth feature vector comprises:

acquiring an eighth eigenvector output by a semantic decoder in the neural network model for processing the first decoded text;

decoding the fifth feature vector, the seventh feature vector and the eighth feature vector by using a content decoder in the neural network model to obtain a second decoded text, wherein the second decoded text is a decoded text at the next moment of the first decoded text;

and calculating the cross entropy loss between the first decoded text and the second decoded text by adopting a minimum classification error algorithm so as to update the network parameters of the neural network model and obtain the voice recognition model.

8. A speech recognition method, comprising:

acquiring audio data to be recognized, wherein the audio data to be recognized comprises conversation contents of a plurality of target objects;

inputting the audio data to be recognized into a voice recognition model, wherein the voice recognition model is obtained by training a neural network model through machine learning by using a plurality of groups of label data, and each group of data in the plurality of groups of label data comprises: the method comprises the following steps of obtaining audio sample data of a sample object, and obtaining a sample object set by extracting a feature vector from the audio sample data;

and receiving a voice recognition processing result returned by the voice recognition model, wherein the voice recognition processing result distinguishes the audio content of each target object and the text information corresponding to the audio content.

9. The method of claim 8, wherein the obtaining audio data to be identified comprises:

acquiring initial audio data;

preprocessing the initial audio data to obtain the audio data to be identified, wherein the preprocessing comprises at least one of the following steps: removing silence, data enhancement, changing audio rate, time warping, frequency masking, text corpus processing.

10. An apparatus for acquiring a speech recognition model, comprising:

an obtaining unit, configured to obtain multiple sets of tag data, where each set of data in the multiple sets of tag data includes: the method comprises the steps of obtaining audio sample data of a sample object and a sample object set obtained by extracting and processing a feature vector of the audio sample data, wherein the audio sample data comprises conversation contents of a plurality of sample objects;

and the training unit is used for training the neural network model through machine learning by using a plurality of groups of label data to obtain a voice recognition model.

11. A speech recognition apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring audio data to be identified, and the audio data to be identified comprises conversation contents of a plurality of target objects;

a transmission module, configured to input the audio data to be recognized into a speech recognition model, where the speech recognition model is obtained by training a neural network model through machine learning using multiple sets of tag data, and each set of data in the multiple sets of tag data includes: the method comprises the following steps of obtaining audio sample data of a sample object, and obtaining a sample object set by extracting a feature vector from the audio sample data;

and the receiving module is used for receiving a voice recognition processing result returned by the voice recognition model, wherein the voice recognition processing result distinguishes the audio content of each target object and the character information corresponding to the audio content.

12. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of obtaining a speech recognition model according to any one of claims 1-7, or a method of speech recognition according to claim 8 or 9.

13. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method of acquiring a speech recognition model according to any one of claims 1-7 or the method of speech recognition according to claim 8 or 9.

14. A computer program product comprising a computer program, wherein the computer program realizes the method of obtaining a speech recognition model according to any of claims 1-7, or the method of speech recognition according to claim 8 or 9, when executed by a processor.