CN115985302A

CN115985302A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115985302A
Application number: CN202211726658.1A
Authority: CN
Inventors: 吴航; 潘嘉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-18

Abstract

The invention relates to the technical field of voice recognition, and provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium. The audio entity feature extraction unit in the voice recognition model is obtained based on training of audio and video training samples without character marking, manual marking is not needed, marking cost is reduced, internal relation between audio data and video data can be fully mined, the pre-training process is enabled to pay more attention to entities in audio, the pre-training process can be applied to voice recognition downstream tasks, and the voice recognition model has the capability of improving hot word recognition effect. Moreover, the voice recognition model can complete different types of voice recognition tasks, the generalization of the voice recognition model can be improved, and the application scene of the voice recognition model is expanded.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

Along with the popularization of voice recognition, the voice recognition technology is applied to various fields, a user can realize intelligent input by means of the voice recognition technology, only the voice is needed, character input, instruction control and the like can be completed, and the production and the life of people are greatly facilitated.

However, the system in the single mode still has some inherent problems, such as insufficient tolerance of Automatic Speech Recognition (ASR) to noise, greatly reduced performance when noise pollution is serious, and incapability of complementing the missing information when voice is missing. video-Audio Speech Recognition (AVSR) is proposed for the defects existing in a single mode, in which a phenomenon of homophone and heterophone exists in video Speech Recognition, or lip language Recognition (VSR), the same lip shape may represent different word pronunciations, and the same word pronunciation may have different lip shape sequences.

In the prior art, for audio and video voice recognition, a large amount of manual labeling data is needed, and the labeling data is time-consuming and expensive. Meanwhile, in the prior art, most of models adopting unsupervised audio and video representation learning use video data of motion recognition and event detection to perform pre-training, and are applied to downstream tasks such as video motion recognition and sound event detection, attention is paid to events occurring in videos, and the application of the downstream tasks in the voice recognition is less. In addition, most models in the prior art that use unsupervised audio-video characterization learning, such as global features like short-term instance-level characterizations in sound event classification scenes, may not be suitable for speech recognition because speech recognition requires continuously changing and contains sequence characterizations of long-term context dependency.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for overcoming the defects in the prior art.

The invention provides a voice recognition method, which comprises the following steps:

acquiring audio data to be processed;

respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model to obtain target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit;

inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit;

inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit;

the voice recognition model is obtained by training an audio training sample carrying character labels on the basis of the audio entity feature extraction unit.

According to the voice recognition method provided by the invention, the audio and video training samples comprise paired audio data samples and video data samples;

the audio entity feature extraction unit is obtained by training based on the following steps:

based on an initial audio characteristic extraction unit, performing characteristic extraction on the audio data sample to obtain audio sample characteristics;

based on an initial video feature extraction unit, performing feature extraction on the video data sample to obtain video sample features;

calculating a first loss function based on the audio sample characteristics and the video sample characteristics, and synchronously performing structural parameter iteration on the initial audio characteristic extraction unit and the initial video characteristic extraction unit based on the first loss function;

and taking a target audio characteristic extraction unit obtained by iterating the structural parameters as the audio entity characteristic extraction unit.

According to the voice recognition method provided by the invention, the positive sample pair in the audio and video training samples is determined based on the following steps:

acquiring a video clip of a preset time period in audio and video data, wherein the duration of the preset time period is a preset duration;

acquiring a preset number of audio clips with preset duration in the audio and video data, wherein the preset number of audio clips includes the preset time period;

and determining the video clips and the audio clips as a positive example sample pair.

According to the voice recognition method provided by the invention, the collecting of the audio frequency fragments with the preset number of the preset duration including the preset time interval in the audio and video data comprises the following steps:

determining an intermediate time of the video segment;

selecting the preset number of audio clips by taking the middle moment as a center and a specified duration as an interval;

wherein the specified duration is less than or equal to the preset duration.

According to the speech recognition method provided by the invention, the first loss function comprises a multi-instance learning noise contrast estimation loss function.

According to the voice recognition method provided by the invention, the feature extraction is performed on the video data sample based on the initial video feature extraction unit to obtain the video sample features, and the method comprises the following steps:

extracting Fbank features in the video data samples;

and inputting the Fbank characteristics to the initial video characteristic extraction unit to obtain the video sample characteristics output by the initial video characteristic extraction unit.

According to the speech recognition method provided by the invention, the speech recognition model is obtained by training based on the following steps:

respectively inputting the audio training samples into an initial acoustic feature extraction unit and the audio entity feature extraction unit to obtain sample acoustic features output by the initial acoustic feature extraction unit and sample entity features output by the audio entity feature extraction unit;

inputting the acoustic characteristics of the sample and the entity characteristics of the sample into an initial splicing unit to obtain a sample splicing result output by the initial splicing unit;

inputting the sample splicing result to an initial voice recognition unit to obtain a sample recognition result output by the initial voice recognition unit;

and calculating a second loss function based on the sample recognition result and the character marking, and synchronously performing structural parameter iteration on the initial acoustic feature extraction unit, the initial splicing unit and the initial voice recognition unit based on the second loss function to obtain the voice recognition model.

The present invention also provides a voice recognition apparatus comprising:

the data acquisition module is used for acquiring audio data to be processed;

the feature extraction module is used for respectively inputting the audio data to an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model to obtain a target acoustic feature output by the acoustic feature extraction unit and a target entity feature output by the audio entity feature extraction unit;

the feature splicing module is used for inputting the target acoustic features and the target entity features into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit;

the voice recognition module is used for inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the speech recognition method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a speech recognition method as described in any one of the above.

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein audio data to be processed is firstly acquired; then, respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of the voice recognition model to obtain target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit; inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit; and finally, inputting the splicing result into a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit. The method can greatly improve the accuracy of the voice recognition result, improve the efficiency of voice recognition and reduce the cost of voice recognition by extracting the target entity characteristics in the audio data. The audio entity feature extraction unit in the voice recognition model is obtained based on training of audio and video training samples without character marking, manual marking is not needed, marking cost is reduced, internal relation between audio data and video data can be fully mined, the pre-training process is enabled to pay more attention to entities in audio, the pre-training process can be applied to voice recognition downstream tasks, and the voice recognition model has the capability of improving hot word recognition effect. Moreover, the voice recognition model can complete different types of voice recognition tasks, the generalization of the voice recognition model can be improved, and the application scene of the voice recognition model is expanded.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to the present invention;

FIG. 2 is a schematic diagram of a positive example sample pair in the speech recognition method provided by the present invention;

FIG. 3 is a second schematic flow chart of a speech recognition method according to the present invention;

FIG. 4 is a schematic diagram of a training process of an audio entity feature extraction unit in the speech recognition method according to the present invention;

FIG. 5 is a schematic diagram of a voice recognition apparatus according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The existing audio and Video pre-training scheme generally includes Advanced Video Coding (AVC), audio Visual Technology Service (AVTS), cross-Modal Deep Clustering (XDC), and other schemes designed based on the audio and Video synchronization idea. These approaches are based primarily on the observation that visual and acoustic events usually occur simultaneously, creating a pre-training task: unsupervised video is cut into short video clips (clips) of 1-3 s, then the video features of each clip and its corresponding audio features should be correlated, while the audio features of other clips should be uncorrelated. Therefore, a network structure can be constructed to respectively extract audio features and video features, and model learning is carried out by judging whether the input audio and video are related. However, such pre-training schemes mostly use motion recognition type and event detection type video data for pre-training, and are applied to downstream tasks such as video motion recognition and sound event detection, and attention is paid to events occurring in videos, and the application in voice recognition type downstream tasks is less, and the capability of improving the hot word recognition effect is not provided.

In addition, most models in the prior art that use unsupervised audio-video characterization learning, such as global features like short-time instance-level characterizations in sound event classification scenes, may not be suitable for speech recognition because speech recognition requires continuous changes and contains sequence characterizations that are long-term context dependent.

Based on this, the embodiment of the invention provides a voice recognition method.

Fig. 1 is a schematic flow chart of a speech recognition method provided in an embodiment of the present invention, as shown in fig. 1, the method includes:

s1, audio data to be processed are obtained;

s2, the audio data are respectively input into an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model, and target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit are obtained;

s3, inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit;

s4, inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit;

Specifically, the execution subject of the speech recognition method provided in the embodiment of the present invention is a speech recognition apparatus, the apparatus may be configured in a computer, the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, and the like, which is not limited specifically herein.

Step S1 is executed first, and audio data to be processed, which refers to speech that needs to be converted into text, is obtained. The field to which the audio data relates may be set according to actual conditions, and is not particularly limited herein.

Then, step S2 is performed, and a speech recognition model is introduced, where the speech recognition model may include an acoustic feature extraction unit, an audio entity feature extraction unit, a concatenation (concat) unit, and a speech recognition unit. Here, the audio data may be input to the acoustic feature extraction unit and the audio entity feature extraction unit of the speech recognition model, respectively, to obtain the target acoustic feature output by the acoustic feature extraction unit and the target entity feature output by the audio entity feature extraction unit. The acoustic feature extraction unit may be an Encoder (Encoder) structure, and the audio entity feature extraction unit may be an LSTM model. The target acoustic features may be used to characterize acoustic information in the audio data and the target entity features may be used to characterize entity information in the audio data.

And then, executing step S3, inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model, and obtaining a splicing result output by the splicing unit. The splicing unit can splice the target acoustic features and the target entity features in the channel dimension. It can be understood that, if the frame length dimensions of the target acoustic feature and the target solid feature are different, the frame length dimensions of the target acoustic feature and the target solid feature need to be adjusted to be the same by means of copying.

And finally, executing the step S4, inputting the splicing result into a voice recognition unit of the voice recognition model, and obtaining a voice recognition result output by the voice recognition unit. The voice recognition result is a text obtained by converting the audio data to be processed.

The audio entity feature extraction unit in the speech recognition model can be obtained by performing unsupervised training based on an audio and video training sample without text labels, wherein the audio and video training sample can be audio and video data, and the audio and video data refers to synchronous audio data and video data. Here, the audio and video training samples can be used to perform self-supervision training on the initial audio feature extraction unit and the initial video feature extraction unit synchronously, so as to obtain a target audio feature extraction unit and a target video feature extraction unit respectively. After that, the trained target audio feature extraction unit can be used as an audio entity feature extraction unit.

Because the audio and the video in the audio and video training sample have synchronism and followability, after the training of the audio and video training sample, the target audio characteristic extraction unit can have the capability of extracting the entity characteristics in the audio, and the target video characteristic extraction unit can have the capability of extracting the entity characteristics in the video. Based on the method, the target audio volume feature extraction unit is used as the audio entity feature extraction unit, and the extraction of entity features can be introduced in the voice recognition process, so that the accuracy of entity recognition in voice recognition is improved, and the accuracy of a voice recognition result is improved.

The voice recognition model is obtained based on the training of the audio training sample carrying the character labels on the basis of the audio entity feature extraction unit, namely, the structural parameters of the audio entity feature extraction unit are fixed, and other units in the voice recognition model are obtained through training. The text labels can be processed into word forms according to word modeling of BPE word segmentation, and are not particularly limited herein.

Here, it can be understood that the speech recognition model is obtained by training an initial speech recognition model by using an audio training sample carrying a text label, the initial speech recognition model includes a part to be trained and a part not to be trained, the part to be trained includes an initial acoustic feature extraction unit, an initial splicing unit and an initial speech recognition unit, and the part not to be trained is an audio entity feature extraction unit because the unit has completed self-supervision training before.

The voice recognition method provided by the embodiment of the invention comprises the steps of firstly, acquiring audio data to be processed; then, respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of the voice recognition model to obtain a target acoustic feature output by the acoustic feature extraction unit and a target entity feature output by the audio entity feature extraction unit; inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit; and finally, inputting the splicing result into a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit. The method can greatly improve the accuracy of the voice recognition result, improve the efficiency of voice recognition and reduce the cost of voice recognition by extracting the target entity characteristics in the audio data. The audio entity feature extraction unit in the voice recognition model is obtained based on training of audio and video training samples without character marking, manual marking is not needed, marking cost is reduced, internal relation between audio data and video data can be fully mined, the pre-training process is enabled to pay more attention to entities in audio, the pre-training process can be applied to voice recognition downstream tasks, and the voice recognition model has the capability of improving hot word recognition effect. Moreover, the voice recognition model can complete different types of voice recognition tasks, the generalization of the voice recognition model can be improved, and the application scene of the voice recognition model is expanded.

On the basis of the above embodiment, in the speech recognition method provided in the embodiment of the present invention, the audio/video training samples include paired audio data samples and video data samples;

based on an initial video feature extraction unit, carrying out feature extraction on the video data sample to obtain video sample features;

and taking a target audio characteristic extraction unit obtained by the iteration of the structural parameters as the audio entity characteristic extraction unit.

Specifically, in the embodiment of the present invention, the audio/video training samples may include paired audio data samples and video data samples, and the logarithm of the audio data samples and the logarithm of the video data samples included in the audio/video training samples may be set according to needs, which is not specifically limited herein. The paired audio data samples and video data samples may include paired positive example sample pairs or paired negative example sample pairs. The positive example sample pairs are composed of audio data samples and video data samples that correspond in time, and the negative example sample pairs are composed of audio data samples and video data samples that do not correspond in time. Here, the temporal correspondence may be in the same period, or in both the inclusion and the included periods. For example, the audio data sample and the video data sample in the positive example pair have the same time period and the same duration, both of which are 3.2s, the audio data sample in the positive example pair may also be a combination of a plurality of 3.2s time periods in a longer time period including the video data sample, and the longer time period may be 9.6s, 16s, and so on.

The frame rate of the video data samples may be set according to the computational complexity and the model effect, and may be, for example, 10pfs, or other values, which is not limited herein. The number of channels of video data samples may be RGB three channels, and the image size per frame may be 112 × 112 or other sizes. The number of channels of the audio data sample can be a single channel, and the sampling rate can be 16000Hz, or other values.

On the basis, when the audio entity feature extraction unit is obtained through training, the audio data sample can be firstly input into the initial audio feature extraction unit, and feature extraction is performed on the audio data sample by using the initial audio feature extraction unit to obtain the feature of the audio sample. The network structure of the initial audio feature extraction unit may be an LSTM model.

And inputting the video data sample into an initial video feature extraction unit, and performing feature extraction on the video data sample by using the initial video feature extraction unit to obtain the video sample features. The network structure of the initial video feature extraction unit may be a 3Dres18 network. The network size of the initial video feature extraction unit may be 32 × 3 × 112 × 112.

Thereafter, the audio sample features and the video sample features may be used to calculate a first loss function, that is, the first loss function may be calculated by taking paired audio sample features and video sample features as positive examples and unpaired audio sample features and video sample features as negative examples. Further, the first loss function may be utilized to synchronously perform structural parameter iteration on the initial audio feature extraction unit and the initial video feature extraction unit until a preset condition is reached, and a target audio feature extraction unit corresponding to the initial audio feature extraction unit and a target video feature extraction unit corresponding to the initial video feature extraction unit are respectively obtained. Wherein the predetermined condition may include reaching a predetermined number of iterations or convergence of the first loss function.

And then, directly taking a target audio characteristic extraction unit obtained by the iteration of the structural parameters as an audio entity characteristic extraction unit for extracting entity characteristics in subsequent audio data.

In the embodiment of the invention, the initial audio characteristic extraction unit and the initial video characteristic extraction unit are jointly trained by using the audio and video training samples, so that the entity information related to the audio and the video can be fully mined, and the entity recognition accuracy in the voice recognition can be further improved.

Due to the fact that narration of videos (namely narration videos) under narration scenes such as sports narration, game narration, live broadcast with goods and the like has certain randomness, the narration content of the narrator and the actual video content can have a certain time misalignment problem, namely the narration content of the current narrator can be a video picture of the first few seconds or the last few seconds instead of the video picture of the current moment. This is caused by the fact that the narrative content of the narrator has a certain divergence and hysteresis and cannot be completely and strictly consistent with the video at the current moment. However, at the same time, the commentary content of the commentator does not come into the wind, so the video picture corresponding to the commentary content can be found in a short time period before and after the commentary content. For this reason, a plurality of normal sample pairs may be constructed around the corresponding video picture, and the normal sample pairs are considered to have the narrative content completely corresponding to the video picture.

Based on this, on the basis of the above embodiment, in the speech recognition method provided in the embodiment of the present invention, the positive example sample pair in the audio/video training samples is determined based on the following steps:

determining the video segment and the preset number of audio segments as one positive example sample pair.

Specifically, in the embodiment of the present invention, when a positive example pair in the audio/video training samples is selected, audio/video data may be obtained first, and the audio/video data may be commentary-like audio/video data. The audio and video data can be subjected to audio processing through a voice vad tool, and a mute video segment of more than 1s in the audio and video data is removed.

Afterwards, the video data in the audio and video data can be equally divided into a plurality of video segments with preset duration, and the time period corresponding to each video segment can be used as the preset time period. Furthermore, the audio clips with preset number of preset duration including the preset time period in the audio and video data can be collected. The preset number can be set as required, and can be an odd number or an integer. For example, it may be 2, 3, 4, 5, etc., and is not particularly limited herein.

Finally, the video clips in the preset time period and the audio clips in the preset number and the preset duration can be directly used as a regular sample pair, that is, the video clips in the preset time period are used as the input of the initial video feature extraction unit, and the audio clips in the preset number and the preset duration are used as the input of the initial audio feature extraction unit. Therefore, the audio frequency fragments in a period of time before and after the preset time period of the video frequency fragments contained in the audio frequency and video frequency training sample can be ensured, the problem of audio frequency and video frequency dislocation can be solved, the accuracy of the audio frequency entity characteristic extraction unit for entity characteristic identification is further ensured, and the performance of the audio frequency entity characteristic extraction unit is improved.

On the basis of the foregoing embodiment, the voice recognition method provided in the embodiment of the present invention, wherein the acquiring of a preset number of audio segments with a preset duration, which include the preset time period, from the audio/video data includes:

determining an intermediate time of the video clip;

wherein the specified duration is less than or equal to the preset duration.

Specifically, in the embodiment of the present invention, when audio segments with a preset number of preset durations within a preset time period are collected in audio/video data, as shown in fig. 2, a middle time t0 of a video segment v may be determined, and then a preset number of audio segments may be selected with the middle time t0 as a center and a specified duration as an interval. The specified duration is less than or equal to the preset duration, namely the condition that the collected audio segments in the adjacent time intervals have overlapped segments is adopted, so that the training effect of the audio entity feature extraction unit can be ensured.

If the preset time period may be 3.2s, the specified time period may be less than or equal to 3.2s, for example, may be 1.6s. Shown in fig. 2 is the case where the preset number is 5. At this time, there are 5 audio pieces, a1, a2, a3, a4, and a5. Further, the video clip v and the audio clips a1, a2, a3, a4, and a5 serve as a pair of audio data samples and video data samples.

In the embodiment of the invention, the middle time of the video clip is taken as the center, the specified duration is taken as the interval, and the preset number of audio clips are selected, so that the audio clips can be uniformly distributed in front of and behind the video clip, and the probability of the corresponding content of each audio clip and the video clip is higher. Moreover, the specified duration is less than or equal to the preset duration, so that overlapped segments exist among the audio segments, and the performance of the trained audio entity feature extraction unit can be better.

On the basis of the foregoing embodiments, in the speech recognition method provided in the embodiments of the present invention, the first loss function includes a multi-instance learning noise-versus-estimation loss function.

Specifically, in the embodiment of the present invention, in the process of training to obtain the audio entity feature extraction unit, it is desirable that the video feature learned by the audio entity feature extraction unit is closer to the audio feature in the positive example sample pair, and is farther from the audio feature in the negative example sample pair. In practice, however, it cannot be determined which of the positive example sample pairs is the true positive example. Therefore, the first Loss function is set as a Multiple-instance learning Noise-correlation Estimation Loss function (MIL-NCE Loss), which transforms the above problem into the following optimization problem:

wherein n is the number of training batches (pitch),

represents a set of positive example sample pairs, x, y represents one positive example sample pair, x represents an audio data sample in a positive example sample pair, y represents a video data sample in a positive example sample pair, f (x) represents an audio sample feature in x, g (y) represents a video sample feature in y, and/or>

Representing a set of negative example sample pairs, x ', y' representing one negative example sample pair, x 'representing an audio data sample in the negative example sample pair, y' representing a video sample feature in the negative example sample pair, f (x) ^′ ) Representing the audio sample features in x ', and g (y ') representing the video sample features in y '.

The idea is to avoid the problem that the true positive example pair is unknown by drawing the distance of all the positive example sample pairs. And with the training, the real positive example pair will gradually take a leading role, so as to achieve the purposes of zooming in the real positive example pair distance and zooming out the negative example sample pair distance.

In the explanation scene, the correspondence between the video frame and the explanation content is considered to be mainly determined by the entity content. For example, the image of the player MX in the video of the soccer narration is strongly correlated with the word "MX" mentioned in the speech of the instructor, and the training process tends to make the target audio feature extraction unit and the target video feature extraction unit learn the key entity information of "MX". Therefore, through the training steps, the obtained target audio characteristic extraction unit and the target video characteristic extraction unit can well learn the capability of extracting the entity characteristics in the voice and the video, and the target audio characteristic extraction unit trained in the way can be used for assisting in voice recognition, so that the recognition effect of the entity characteristics such as proper nouns in the voice and the video is improved.

On the basis of the foregoing embodiment, the speech recognition method provided in the embodiment of the present invention, where the feature extraction is performed on the audio data sample based on the initial audio feature extraction unit to obtain the feature of the audio sample, includes:

extracting Fbank features in the audio data samples;

and inputting the Fbank characteristics to the initial audio characteristic extraction unit to obtain the audio sample characteristics output by the initial audio characteristic extraction unit.

Specifically, in the embodiment of the present invention, the input of the initial audio feature extraction unit may be an audio data sample, or an Fbank feature in the audio data sample. Therefore, the Fbank feature in the audio data sample, for example, the Fbank feature with 80 dimensions, may be extracted first. In extracting the Fbank feature, the window length may be set to 25ms and the frame shift may be set to 10ms.

After that, the extracted Fbank features can be input to the initial audio feature extraction unit, so as to obtain the audio sample features output by the initial audio feature extraction unit.

In the embodiment of the present invention, there are two reasons for using the Fbank feature: 1) The Fbank characteristic is that audio is converted from time domain to frequency domain, and stable audio characteristics can be extracted; 2) The frequency spectrum of the Fbank features is two-dimensional, and can be processed by using a deep neural network such as a convolution and a transform.

On the basis of the above embodiment, in the speech recognition method provided in the embodiment of the present invention, the speech recognition model is determined based on the following steps:

calculating a second loss function based on the sample recognition result and the character label, and synchronously performing structural parameter iteration on the initial acoustic feature extraction unit, the initial splicing unit and the initial voice recognition unit based on the second loss function to obtain a target acoustic feature extraction unit, a target splicing unit and a target voice recognition unit;

and determining the voice recognition model based on the target acoustic feature extraction unit, the target splicing unit, the target voice recognition unit and the audio entity feature extraction unit.

Specifically, when the initial recognition model is trained to obtain the speech recognition model, the audio training samples may be input to the initial acoustic feature extraction unit and the audio entity feature extraction unit, respectively, to obtain the sample acoustic features output by the initial acoustic feature extraction unit and the sample entity features output by the audio entity feature extraction unit.

Then, copying the physical characteristics of the sample in the dimension of the frame length, inputting the copied result and the acoustic characteristics of the sample into an initial splicing unit, and splicing the copied result and the acoustic characteristics of the sample in the dimension of a channel by using the initial splicing unit to obtain a sample splicing result; then, inputting the sample splicing result into an initial voice recognition unit to obtain a sample recognition result output by the initial voice recognition unit; and finally, calculating a second loss function by using the sample identification result and the character label. The initial acoustic feature extraction unit can be an encoder, the initial speech recognition unit can be a decoder, and the initial acoustic feature extraction unit and the initial speech recognition unit form an encoder-decoder structure. At this time, the second loss function may be a cross entropy loss function, for example, may be CE loss. The initial speech recognition unit may also be a fully connected network, in which case the second loss function may be CTC loss.

According to the second loss function, structural parameter iteration can be synchronously performed on the initial acoustic feature extraction unit, the initial splicing unit and the initial voice recognition unit, and finally the target acoustic feature extraction unit, the target splicing unit and the target voice recognition unit are obtained.

And finally, combining the target acoustic feature extraction unit, the target splicing unit, the target voice recognition unit and the audio entity feature extraction unit to jointly form a voice recognition model.

To sum up, as shown in fig. 3, a flow diagram of a speech recognition method provided in the embodiment of the present invention is shown, in a speech recognition model adopted in the method, an acoustic feature extraction unit is an encoder, an audio entity feature extraction unit is an LSTM model, and a speech recognition unit is a decoder. The method comprises the following steps:

1) Acquiring Audio data to be processed;

2) Respectively inputting Audio data Audio to an encoder and an LSTM model to obtain target acoustic characteristics output by the encoder and target entity characteristics output by the LSTM model;

3) Inputting the target acoustic characteristics and the target entity characteristics into a splicing unit to obtain a splicing result output by the splicing unit;

4) And inputting the splicing result into a decoder, and performing autoregressive decoding by the decoder to obtain and output a voice recognition result.

As shown in fig. 4, in the training process of the audio entity feature extraction unit, the network structure adopted by the initial video feature extraction unit is a 3Dres18 network, and the network structure adopted by the initial audio feature extraction unit is an LSTM model. The input of the initial Video feature extraction unit is a 3.2s Video clip Video, which is processed into a form of 32 × 3 × 112, and the initial Video feature extraction unit extracts Video sample features g (y); the single input of the initial Audio feature extraction unit is the Fbank feature extracted from 5 segments of 3.2s Audio clips Audio × 5, and 5 Audio sample features f (x) are extracted by the initial Audio feature extraction unit. In order to solve the problem of audio-video dislocation, the initial audio feature extraction unit inputs 5 audio segments of 3.2s before and after the time interval corresponding to the initial video feature extraction unit, and 5 audio features are extracted to serve as subsequent positive samples.

In the training process, MIL-NCE loss is adopted as a first loss function for supervision.

The scheme provided by the embodiment of the invention improves the voice recognition effect under the commentary scene based on audio and video pre-training, and particularly improves the hot word recognition effect. Firstly, collecting a large amount of unsupervised commentary video data, and then performing data cleaning; and then, a network is constructed to respectively extract video features and audio features, the problem of video and audio dislocation possibly existing in the explanation videos is solved by using the MIL-NCE loss, and audio and video pre-training is realized. Since the most significant related information in commentary-type video and audio is generally an entity such as a person or a thing, it is considered that the capability of extracting entity information by the model can be significantly enhanced through the pre-training. After the pre-training is finished, a target audio characteristic extraction unit is taken out, on a downstream supervision voice recognition task, information extracted by the target audio characteristic extraction unit and the characteristics extracted by an encoder in a primitive voice recognition network are spliced and sent to a decoder together for voice recognition training. According to the scheme, the capability of the voice recognition model for capturing entity information is improved through audio and video pre-training, so that the recognition effect of proper nouns of entity parts of speech in voice recognition is further improved.

As shown in fig. 5, on the basis of the above embodiment, an embodiment of the present invention provides a speech recognition apparatus, including:

a data obtaining module 51, configured to obtain audio data to be processed;

the feature extraction module 52 is configured to input the audio data to an acoustic feature extraction unit and an audio entity feature extraction unit of a speech recognition model, respectively, to obtain a target acoustic feature output by the acoustic feature extraction unit and a target entity feature output by the audio entity feature extraction unit;

the feature splicing module 53 is configured to input the target acoustic feature and the target entity feature to a splicing unit of the speech recognition model, so as to obtain a splicing result output by the splicing unit;

a speech recognition module 54, configured to input the concatenation result to a speech recognition unit of the speech recognition model, so as to obtain a speech recognition result output by the speech recognition unit;

On the basis of the above embodiment, in the speech recognition apparatus provided in the embodiment of the present invention, the audio/video training samples include paired audio data samples and video data samples;

the voice recognition device comprises an audio and video pre-training module and is used for:

On the basis of the foregoing embodiment, the speech recognition apparatus provided in the embodiment of the present invention further includes a positive example sample pair determining module, configured to:

acquiring a preset number of audio clips with the preset duration, including the preset time period, in the audio and video data;

On the basis of the foregoing embodiment, in the speech recognition apparatus provided in the embodiment of the present invention, the positive example sample pair determining module is specifically configured to:

determining an intermediate time of the video segment;

wherein the specified duration is less than or equal to the preset duration.

On the basis of the foregoing embodiments, in the speech recognition apparatus provided in the embodiments of the present invention, the first loss function includes a multi-instance learned noise versus estimation loss function.

On the basis of the foregoing embodiment, in the speech recognition device provided in the embodiment of the present invention, the audio/video pre-training module is specifically configured to:

extracting Fbank features in the audio data samples;

On the basis of the foregoing embodiment, the speech recognition apparatus provided in the embodiment of the present invention further includes a model training module, configured to:

calculating a second loss function based on the sample recognition result and the character marking, and synchronously performing structural parameter iteration on the initial acoustic feature extraction unit, the initial splicing unit and the initial voice recognition unit based on the second loss function to obtain a target acoustic feature extraction unit, a target splicing unit and a target voice recognition unit;

Specifically, the functions of the modules in the speech recognition apparatus provided in the embodiment of the present invention correspond to the operation flows of the steps in the embodiments of the methods one to one, and the implementation effects are also consistent.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a Processor (Processor) 610, a communication Interface (Communications Interface) 620, a Memory (Memory) 630 and a communication bus 640, wherein the Processor 610, the communication Interface 620 and the Memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the speech recognition methods provided in the various embodiments described above, including: acquiring audio data to be processed; respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model to obtain target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit; inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit; inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit; the voice recognition model is obtained by training an audio training sample carrying character labels on the basis of the audio entity feature extraction unit.

In addition, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the speech recognition method provided in the above embodiments, the method including: acquiring audio data to be processed; respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model to obtain target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit; inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit; inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit; the voice recognition model is obtained by training an audio training sample carrying character labels on the basis of the audio entity feature extraction unit.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the speech recognition method provided in the above embodiments, the method including: acquiring audio data to be processed; respectively inputting the audio data into an acoustic feature extraction unit and an audio entity feature extraction unit of a voice recognition model to obtain target acoustic features output by the acoustic feature extraction unit and target entity features output by the audio entity feature extraction unit; inputting the target acoustic characteristics and the target entity characteristics into a splicing unit of the voice recognition model to obtain a splicing result output by the splicing unit; inputting the splicing result to a voice recognition unit of the voice recognition model to obtain a voice recognition result output by the voice recognition unit; the voice recognition model is obtained by training an audio training sample carrying character labels on the basis of the audio entity feature extraction unit.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring audio data to be processed;

2. The speech recognition method of claim 1, wherein the audiovisual training samples comprise paired audio data samples and video data samples;

based on an initial audio characteristic extraction unit, carrying out characteristic extraction on the audio data sample to obtain the characteristics of the audio sample;

3. The speech recognition method of claim 2, wherein the positive example sample pairs in the audiovisual training samples are determined based on the steps of:

acquiring video clips of a preset time period in audio and video data, wherein the duration of the preset time period is a preset duration;

4. The speech recognition method according to claim 3, wherein the acquiring of a preset number of audio segments of the preset duration including the preset time period in the audio/video data comprises:

determining an intermediate time of the video segment;

wherein the specified duration is less than or equal to the preset duration.

5. The speech recognition method of claim 2, wherein the first loss function comprises a multi-instance learned noise versus estimate loss function.

6. The speech recognition method of claim 2, wherein the performing feature extraction on the audio data sample based on an initial audio feature extraction unit to obtain the audio sample features comprises:

extracting Fbank features in the audio data samples;

7. The speech recognition method according to any one of claims 1-6, wherein the speech recognition model is determined based on the steps of:

8. A speech recognition apparatus, comprising:

the data acquisition module is used for acquiring audio data to be processed;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any of claims 1-7 when executing the program.

10. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the speech recognition method according to any one of claims 1-7.