CN117649846B

CN117649846B - Speech recognition model generation method, speech recognition method, device and medium

Info

Publication number: CN117649846B
Application number: CN202410119020.4A
Authority: CN
Inventors: 徐银海; 刘益帆; 丁丹; 赵明洲
Original assignee: Beijing Ancsonic Technology Co ltd
Current assignee: Beijing Ancsonic Technology Co ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-04-30
Anticipated expiration: 2044-01-29
Also published as: CN117649846A

Abstract

The embodiment of the disclosure discloses a voice recognition model generation method, a voice recognition method, equipment and a medium. One embodiment of the method comprises the following steps: the sample audio information is encoded through the initial audio encoding sub-model, so that audio encoding information is obtained; performing feature extraction processing on each piece of key text information through the initial key text sub-model to obtain at least one piece of key text feature information; carrying out fusion decoding processing on the audio coding information and at least one key text characteristic information through an initial fusion decoding sub-model to obtain text information; determining whether the initial model is trained according to the sample text information and the text information; in response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model. The voice recognition model obtained by the voice recognition model generation method of some embodiments of the present disclosure can improve the recognition accuracy of specific vocabularies, thereby improving the accuracy and recall rate of voice recognition and improving the recognition effect.

Description

Speech recognition model generation method, speech recognition method, device and medium

Technical Field

The embodiment of the disclosure relates to the technical field of voice recognition, in particular to a voice recognition model generation method, a voice recognition method, equipment and a medium.

Background

Speech recognition technology transcribes speech signals into corresponding text or commands by a computer. With the development of the fields of pattern recognition and natural language understanding, the research and application fields of the speech recognition and speech evaluation technology are becoming wider and wider. At present, the speech recognition method is generally adopted as follows: a Conformer speech recognition model of an encoding-decoding (encoder-decoder) network architecture is adopted, and the waveform of a speech signal is taken as an input, and the speech signal is converted into corresponding text or command through a series of deep learning layers.

However, when the above-described voice recognition method is adopted, there are often the following technical problems:

conformer the speech recognition model relies on the distribution of the training data set, and for some long-tail distributed hot words (such as name, place name, professional term, etc.), the recognition accuracy or recall rate is low, and the recognition effect is poor.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a speech recognition model generation method, a speech recognition apparatus, and a computer-readable medium to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for generating a speech recognition model, the method comprising: the sample audio information is encoded through an initial audio encoding sub-model included in the initial model, so that audio encoding information is obtained; performing feature extraction processing on each key text information in at least one key text information through an initial key text sub-model included in the initial model to obtain at least one key text feature information, wherein the key text information is generated according to sample text information corresponding to sample audio information; performing fusion decoding processing on the audio coding information and at least one key text characteristic information through an initial fusion decoding sub-model included in the initial model to obtain text information; determining whether the initial model is trained according to sample text information corresponding to the sample audio information and the obtained text information; in response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model.

In a second aspect, some embodiments of the present disclosure provide a method of speech recognition, the method comprising: acquiring audio information to be identified; generating recognition text information corresponding to the audio information to be recognized according to the audio information to be recognized, the keyword information and the voice recognition model, wherein the voice recognition model is generated according to any implementation manner of the first aspect.

In a third aspect, some embodiments of the present disclosure provide a speech recognition apparatus comprising: one or more processors; the storage device is stored with one or more programs, and the audio acquisition equipment is used for acquiring the audio information to be identified; the display device is used for displaying the identification text information corresponding to the audio information to be identified; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described in any of the implementations of the first aspect described above.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect above.

The above embodiments of the present disclosure have the following advantageous effects: the voice recognition model obtained by the voice recognition model generation method of some embodiments of the present disclosure can improve recognition accuracy and recall rate, thereby improving recognition effect. Specifically, the reason for the poor recognition effect is that: conformer the speech recognition model relies on the distribution of the training dataset, and for some hotwords (e.g., name, place name, term of art, etc.) that are long-tailed, recognition accuracy or recall is low. Based on this, in the speech recognition model generation method according to some embodiments of the present disclosure, first, the sample audio information is encoded by an initial audio encoding sub-model included in the initial model, so as to obtain audio encoded information. Thus, the original speech signal can be converted into a digital representation so that it can be processed and analyzed by a computer. And secondly, performing feature extraction processing on each key text information in at least one key text information through an initial key text sub-model included in the initial model to obtain at least one key text feature information, wherein the key text information is generated according to sample text information corresponding to sample audio information. Thus, feature vectors of keywords (e.g., name of person, place name, term of art, etc.) in the sample text information can be extracted for subsequent fusion. And then, carrying out fusion decoding processing on the audio coding information and at least one key text characteristic information through an initial fusion decoding sub-model included in the initial model to obtain text information, thereby adding key word characteristics besides the audio characteristics when determining the text information, and further improving the recognition accuracy of specific words. And then, determining whether the initial model is trained according to the sample text information corresponding to the sample audio information and the obtained text information. Thus, the obtained text information and the sample text information can be compared to determine whether the initial model is trained. Finally, in response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model. Therefore, the voice recognition model obtained by the voice recognition model generation method of some embodiments of the present disclosure fuses the audio feature and the keyword feature corresponding to the sample text information during training, so that the recognition accuracy of the specific vocabulary can be improved, the accuracy and recall rate of voice recognition can be further improved, and the recognition effect can be improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic illustration of one application scenario of some embodiments of the speech recognition model generation method of the present disclosure;

FIG. 2 is a flow chart of some embodiments of a speech recognition model generation method according to the present disclosure;

FIG. 3 is a flow chart of some embodiments of a speech recognition method according to the present disclosure;

Fig. 4 is a schematic diagram of a voice recognition device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Operations such as collection, storage, use, etc. of personal information (e.g., audio information to be identified) of a user involved in the present disclosure, and before performing the corresponding operations, the relevant organization or individual is up to the end to include developing personal information security impact assessment, fulfilling informed obligations to the personal information body, soliciting authorized consent of the personal information body in advance, etc.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of one application scenario of a speech recognition model generation method of some embodiments of the present disclosure.

In the application scenario of fig. 1, first, the computing device 101 may input the sample audio information 103 into an initial audio coding sub-model 1021 included in the initial model 102, resulting in audio coding information. The computing device 101 may then input each of the at least one key text message 104 into an initial key text sub-model 1022 included in the initial model 102, resulting in at least one key text feature information. Wherein the key text information is generated from the sample audio information. For example, the at least one key text message 104 may be "Zhang san", luancheng ". Thereafter, the computing device 101 may input the audio encoding information and the at least one key text feature information into an initial fusion decoding sub-model 1023 included in the initial model 102, resulting in text information 105. For example, the text message 105 may be "Zhang San to sit on a train to Luancheng" today. The computing device 101 may then determine whether the initial model 102 is trained based on the sample text information corresponding to the sample audio information and the resulting text information 105. Finally, the computing device 101 may determine the initial model 102 as a speech recognition model in response to determining that the initial model 102 training is complete.

The computing device 101 may be hardware or software. When the computing device is hardware, the computing device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the computing device is embodied as software, it may be installed in the hardware devices listed above. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

It should be understood that the number of computing devices in fig. 1 is merely illustrative. There may be any number of computing devices, as desired for an implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a speech recognition model generation method according to the present disclosure is shown. The voice recognition model generation method comprises the following steps:

And step 201, performing coding processing on the sample audio information through an initial audio coding sub-model included in the initial model to obtain audio coding information.

In some embodiments, an execution subject of the speech recognition model generation method (e.g., the computing device shown in fig. 1) may encode the sample audio information by an initial audio encoding sub-model included in the initial model to obtain audio encoded information. The sample audio information may be chinese audio for training a speech recognition model. As an example, the sample audio information can be audio with the pronunciation "zh ā ng s ā n j ī n ti ā n zu co hu ǒ ch q lu n ch ng le". Furthermore, in order to improve the speech recognition model obtained by training, the sample audio information may be audio recorded by people with different sexes and different regions. The initial audio coding sub-model may be a coding model that takes audio information as input and audio coding information as output. For example, the initial audio coding sub-model may be a perceptual coding model or a waveform coding model. In practice, the executing body may input the sample audio information into an initial audio coding sub-model included in the initial model, to obtain audio coding information.

And 202, performing feature extraction processing on each piece of key text information in at least one piece of key text information through an initial key text sub-model included in the initial model to obtain at least one piece of key text feature information.

In some embodiments, the executing body may perform feature extraction processing on each of the at least one piece of key text information through an initial key text sub-model included in the initial model, so as to obtain at least one piece of key text feature information. The key text information may be generated according to sample text information corresponding to the sample audio information. The key text information may be an irregular field that is truncated from the sample text information corresponding to the sample audio information. For example, the key text information may be text representing a name of a person, a name of a place, or a proper academic noun. As an example, the sample text information may be "Zhang San sitting on a train to Luancheng" today. The key text information may be "Zhang san", luancheng ". The key text information may also be selected randomly from consecutive fields in the sample text information according to a preset selection window. The size of the preset section window can be an integer between 2 and 6, and the preset section window is used for simulating keyword scenes with different lengths. The initial key text sub-model may be a feature extraction model that takes key text information as input and key text feature information as output. For example, the initial key text sub-model may be a convolutional neural network model. In practice, the execution body may input each of the at least one piece of key text information into the initial key text sub-model to obtain at least one piece of key text feature information. Therefore, word vectors of each keyword can be obtained through the keyword text sub-model, the characterization capability of the voice recognition model on the keywords can be enhanced, and the recognition effect of the voice recognition model on the keywords can be improved. Meanwhile, different keywords are enhanced by independently configuring different word vectors, so that the calculation cost of the whole audio repeated training with the keywords can be reduced.

In some alternative implementations of some embodiments, the key text information is generated by:

And firstly, performing part-of-speech word segmentation on the sample text information to obtain word sets. The words in the word set can be words after the stop words are removed from the text in the sample text information. Each word in the word set may be a noun, a preposition, a verb, or an adjective. In practice, the execution subject can perform part-of-speech word segmentation on the sample text information through a word segmentation algorithm based on rules to obtain word sets. The rule-based word segmentation algorithm described above may include, but is not limited to, at least one of: maximum matching method, minimum matching method, and bidirectional matching method.

And secondly, dividing the word set according to the part of speech corresponding to each word in the word set to obtain at least one word set. Wherein each word group in the at least one word group corresponds to the same part of speech. In practice, the execution subject may divide each word corresponding to the same part of speech into a group.

And thirdly, determining the word group meeting the preset part-of-speech condition in the at least one word group as a target word group to obtain at least one target word group. The preset part-of-speech condition may be that the word corresponding to the word group is a noun, a verb or an adjective.

And step four, for each target word group in the at least one target word group, inputting each target word in the target word group into a pre-trained key text generation model to obtain key text information corresponding to the target word group. The pre-trained keyword generation model may be a classification model with each target word as input and keyword information as output.

In some alternative implementations of some embodiments, the key text generation model is generated by the training steps of:

First, a sample is obtained. The sample comprises sample target word groups and sample key text information corresponding to the sample target words. The sample target word group can be preset words used for training a key text generation model. As an example, the sample target word group may include at least one of: place name, person name, and academic nouns. The sample key text information may be the place name, the person name and the academic noun in the sample target word group. The sample key text information may include a sample place name key text set, a sample academic noun key text set and a sample person name key text set.

And secondly, inputting the sample target word group into a place name key text generation submodel included in the model to be trained to obtain a place name key text set and a non-place name key text set. The place name key text generation sub-model can be a first classification model used for screening place names in sample target word groups.

And thirdly, inputting the non-place name key text set into an academic noun key text generation sub-model included in the model to be trained, and obtaining an academic noun key text set and a non-academic noun key text set. The academic noun key text generating sub-model can be a second classification model for screening academic nouns in the sample target word group.

And fourthly, inputting the non-academic noun key text set into a name key text generation sub-model included in the model to be trained, and obtaining a name key text set and a non-name key text set. The name key text generation sub-model may be a third classification model for screening names in the sample target word group.

In this embodiment, the first classification model, the second classification model, and the third classification model may use the same classification model. The first classification model, the second classification model, and the third classification model may be different classification models. The classification model may be, but is not limited to, at least one of: support Vector Machines (SVMs), decision trees, random forests, gradient boosting tree classification models.

And fifthly, inputting the place name key text set, the academic noun key text set and the name key text set into an initial output sub-model included in the model to be trained, and obtaining key text information. The initial output sub-model may be a convolutional network for stitching a place name key text set, an academic noun key text set, and a person name key text set.

And sixthly, determining a sample place name key text set and place name loss values of the obtained place name key text set. In practice, the execution body may input a specified loss function (loss function) with the sample place name keyword text set and the obtained place name keyword text set as parameters, to obtain a place name loss value.

In this embodiment, the loss function is typically used to measure the degree of inconsistency between a predicted value (e.g., a set of place name key texts) and a true value (e.g., a set of sample place name key texts) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function can be set according to actual requirements.

As an example, the above-described loss function may be, but is not limited to, at least one of: cross entropy loss function, least squares function, or log loss.

Seventh, determining a sample academic noun key text set and an academic noun loss value of the obtained academic noun key text set. In practice, the executing body may input a loss function with the sample academic noun key text set and the obtained academic noun key text set as parameters, to obtain an academic noun loss value.

And eighth step, determining a sample name key text set and name loss values of the obtained name key text set. In practice, the executing body may input a loss function with the sample name keyword text set and the obtained name keyword text set as parameters, to obtain a name loss value.

And ninth, weighting the place name loss value, the academic noun loss value and the person name loss value according to the preset place name weight, the preset academic noun weight and the preset person name weight to obtain a total loss value corresponding to the key text information. The preset place name weight may be a weight of place name loss value. The predetermined academic noun weight may be a weight of an academic noun loss value. The preset name weight may be a weight of a name loss value. In practice, the execution subject may determine, as the total loss value corresponding to the key text information, a sum of a product of the place name loss value and a preset place name weight, a sum of a product of the academic noun loss value and a preset academic noun weight, and a product of the person name loss value and a preset person name weight.

In this embodiment, the preset place name weight, the preset academic noun weight, and the preset person name weight may be set according to actual situations. While target values may generally be used to represent the ideal case of a degree of disagreement between predicted values (i.e., place name key text set, academic noun key text set, and person name key text set) and true values (i.e., sample place name key text set, sample skill noun key text set, and sample person name key text set). That is, when the total loss value reaches the target value, the predicted value can be considered to be close to or approximate to the true value. The target value may be set according to actual requirements.

In some alternative implementations of the present embodiment, the preset place name weight, the preset academic noun weight, and the preset person name weight may be a fixed weight value, respectively. Moreover, because the recognition accuracy of the name is low, the preset name weight can be preset to be relatively large, such as 0.5. Meanwhile, the preset place name weight and the preset academic noun weight may be preset to be relatively small, such as 0.2 and 0.3.

In other alternative implementations of this embodiment, in order to improve the accuracy of the detection result, the execution subject may dynamically adjust the preset place name weight, the preset academic noun weight, and the preset person name weight according to different sample target word groups. That is, the preset place name weight, the preset academic noun weight, and the preset person name weight may be non-fixed weight values. As an example, the preset place name weight, the preset academic noun weight, and the preset person name weight may be obtained by:

First, the execution body may determine a number ratio of the number of place name key texts in the place name key text set, the number of academic noun key texts in the academic noun key text set, and the number of person name key texts in the person name key text set.

Then, the execution subject may determine a preset place name weight, a preset academic noun weight, and a preset person name weight according to the number ratio. For example, the ratio of the number of place name key texts, the number of academic noun key texts and the number of person name key texts is 2:1:1. the preset place name weight, the preset academic noun weight and the preset person name weight may be 0.5 respectively: 0.25:0.25.

And tenth, determining whether the model to be trained is trained according to the total loss value. In practice, the execution subject may determine that the initial model training is completed when the total loss value reaches the target value. In the event that the total loss value does not reach the target value, it is determined that the initial model is not trained.

And eleventh, adjusting relevant parameters in the model to be trained in response to determining that the model to be trained is not trained. In practice, in response to determining that the model to be trained is not trained to completion, relevant parameters in the initial model may be adjusted. For example, using back propagation techniques to modify the weights in the various convolutional layers in the initial model. And may return to the first step to re-select samples from the sample set. So that the training steps described above can be continued.

And twelfth, determining the model to be trained as a key text generation model in response to determining that the training of the model to be trained is completed.

And 203, performing fusion decoding processing on the audio coding information and the at least one key text characteristic information through an initial fusion decoding sub-model included in the initial model to obtain text information.

In some embodiments, the executing body may perform fusion decoding processing on the audio coding information and the at least one key text feature information through an initial fusion decoding sub-model included in the initial model to obtain the text information. As an example, the execution body may decode the audio encoding information and the feature fusion information corresponding to the at least one key text feature information by using a decoder to obtain the text information. For example, the decoder may be a dynamic decoder.

In some alternative implementations of some embodiments, the initial fusion decoding sub-model described above may include a feature fusion layer, a combination layer, and a decoding layer.

In some optional implementations of some embodiments, the executing body may perform fusion decoding processing on the audio coding information and the at least one key text feature information through an initial fusion decoding sub-model included in the initial model by executing the following steps to obtain the text information:

The first step, the audio coding information and at least one key text feature information are subjected to feature fusion through a feature fusion layer, so that fusion feature information is obtained. In practice, the execution subject may perform feature fusion on each key text feature information in the audio coding information and the at least one key text feature information through the feature fusion network, so as to obtain fused feature information. As an example, the feature fusion network may be a DenseNet convolutional network or a Resnet network.

And secondly, combining the fusion characteristic information and the audio coding information through a combination layer to obtain the combined characteristic information. In practice, the execution subject may splice the fusion feature information and the audio coding information to obtain the combined feature information.

And thirdly, decoding the combined characteristic information through a decoding layer to obtain text information corresponding to the sample audio information. In practice, the execution main body can decode the combined characteristic information through a decoder to obtain text information corresponding to the sample audio information.

Therefore, during decoding, the fusion characteristic information is processed to improve the recognition effect of the keywords, and the decoding of the original audio coding information is reserved, so that the general recognition performance of the model can be improved.

In some optional implementations of some embodiments, the executing body may encode the sample audio information through an initial audio encoding sub-model included in the initial model by executing the following steps to obtain audio encoded information:

The first step, the sample audio information is preprocessed through an input layer included in the initial audio coding sub-model, and preprocessed audio information is obtained. In practice, the execution subject may preprocess the sample audio information by a preprocessing algorithm to obtain preprocessed audio information. Wherein the preprocessing algorithm can include, but is not limited to, at least one of the following: noise reduction algorithm and filtering algorithm.

And secondly, framing the preprocessed audio information through a framing layer included in the initial audio coding sub-model to obtain an audio frame set. In practice, the executing body may divide the preprocessed audio information into a series of short periods according to a certain time interval, so as to obtain an audio frame set.

And thirdly, carrying out normalization processing on each audio frame in the audio frame set through a normalization layer included in the initial audio coding sub-model to obtain a normalized audio frame set. In practice, the executing body may normalize each audio frame in the audio frame set by using a short-time fourier transform and mel-frequency cepstrum coefficient, so as to obtain a normalized audio frame set.

And fourthly, carrying out endpoint segmentation on the normalized audio frame set through an endpoint segmentation layer included in the initial audio coding sub-model to obtain an audio fragment set. In practice, first, the execution body may perform spectral entropy calculation on each normalized audio frame to obtain a spectral entropy value of each normalized audio frame. Then, the executing body may determine a start normalized audio frame and an end normalized audio frame according to a variation trend of the spectral entropy value. As an example, in response to determining that the spectral entropy value is greater than or equal to a preset spectral entropy threshold, a normalized audio frame corresponding to the spectral entropy value is determined as the starting normalized audio frame. And in response to determining that the spectral entropy value is smaller than the preset spectral entropy threshold, determining the normalized audio frame corresponding to the spectral entropy value as the normalized audio frame. And finally, dividing the normalized audio frame set according to the determined initial normalized audio frame and the normalized audio frame to obtain an audio fragment set.

And fifthly, coding the audio fragment set through a coding layer included in the initial audio coding sub-model to obtain audio coding information, wherein the audio coding information comprises at least one audio fragment coding information. In practice, the executing body may perform encoding processing on each audio clip in the audio clip set by using an encoder, so as to obtain at least one audio clip encoding information as the audio encoding information. Wherein the audio coding information comprises at least one audio clip coding information.

Continuing, when the speech recognition model generated by the speech recognition model generating method disclosed by the disclosure is adopted, the following technical problem II further exists:

different audio frequency is adapted to different coding modes, and different audio frequency is coded by adopting the same coding mode, so that the quality and naturalness of part of audio frequency coding information are low, and the subsequent speech recognition effect is poor.

In some optional implementations of some embodiments, the executing body may perform encoding processing on the audio clip set through an encoding layer included in the initial audio encoding sub-model by executing the following steps to obtain audio encoding information:

first, for each audio clip in the audio clip set, performing the steps of:

A first substep, determining whether the sampling rate corresponding to the audio segment meets a preset sampling rate condition. The preset sampling rate condition may be that a sampling rate corresponding to the audio segment is greater than or equal to a preset sampling rate threshold. The preset sampling rate threshold may be a preset threshold above which waveform encoding is suitable.

And a second sub-step, in response to determining that the sampling rate corresponding to the audio segment meets the preset sampling rate condition, performing waveform encoding processing on the audio segment to obtain waveform encoding information. In practice, the executing body may perform waveform encoding processing on the audio segment by using various waveform encoding algorithms to obtain waveform encoding information. For example, the waveform encoding algorithm may be a pulse code modulation algorithm or an adaptive delta modulation algorithm.

And a third sub-step, performing vocoder processing on the waveform coding information to obtain audio segment coding information. In practice, the executing body may perform vocoder processing on the waveform encoding information by using various vocoder algorithms to obtain audio segment encoding information. For example, the vocoder algorithm may be a linear predictive coding algorithm or a regular pulse excitation coding algorithm. Therefore, when the sampling rate is higher, the audio frequency fragment can be encoded through the waveform encoding, and then the waveform encoding information is further processed and compressed through the vocoder, so that the high fidelity of the waveform encoding and the compression capability of the vocoder can be utilized to realize higher-quality voice recognition.

And a fourth sub-step, in response to determining that the sampling rate corresponding to the audio segment does not meet the preset sampling rate condition, performing parameter extraction processing on the audio segment to obtain parameter characteristic information. In practice, the execution main body may perform parameter extraction processing on the audio segment through a feature parameter extraction algorithm to obtain parameter feature information. For example, the characteristic parameter extraction algorithm may be a mel-frequency spectrum algorithm.

And fifth, carrying out parameter coding processing on the parameter characteristic information to obtain characteristic parameter coding information. In practice, the execution subject may perform encoding processing on the parameter feature information through a parameter encoding algorithm to obtain feature parameter encoding information. For example, the above-described parametric coding may be a differential coding algorithm.

And a sixth sub-step, performing waveform coding processing on the characteristic parameter coding information to obtain audio fragment coding information. In practice, the executing body may perform waveform encoding processing on the characteristic parameter encoding information through various waveform encoding algorithms to obtain audio segment encoding information. Therefore, when the sampling rate is low, firstly, the parameter coding technology is used for carrying out feature extraction and coding on the input voice signal so as to remove redundant information and irrelevant information in the signal, and the feature parameters such as short-time amplitude, frequency and phase of the voice signal are extracted. And then further encoding and compressing the residual signals through waveform encoding, so that the data volume is reduced, the compression efficiency is improved, and meanwhile, the quality and the naturalness of the signals are maintained, and the naturalness and the intelligibility of the signals can be maintained. Providing a higher compression ratio and better speech quality.

And secondly, determining the obtained audio fragment coding information as audio coding information.

The first step and the second step are taken as an invention point of the embodiment of the disclosure, so that the technical problem that different audios are adapted to different coding modes, and the same coding mode is adopted to code different audios, so that the quality and naturalness of part of audio coding information are lower, and the subsequent speech recognition effect is poor is solved. The problem that results in poor effect of subsequent speech recognition is that: different audios are adapted to different coding modes, and different audios are coded by adopting the same coding mode, so that the treatment and naturalness of part of audio coding information are low. If the above factors are solved, the effect of improving the subsequent voice recognition can be achieved. In order to achieve the effect, when the sampling rate is higher, the audio can be encoded through the waveform encoding, and then the waveform encoding information is further processed and compressed through the vocoder, so that the high fidelity of the waveform encoding and the compression capability of the vocoder can be utilized to realize higher-quality voice recognition. When the sampling rate is low, firstly, the parameter coding technology is used for carrying out feature extraction and coding on the input voice signal so as to remove redundant information and irrelevant information in the signal, and the feature parameters such as short-time amplitude, frequency and phase of the voice signal are extracted. And then further encoding and compressing the residual signals through waveform encoding, so that the data volume is reduced, the compression efficiency is improved, and meanwhile, the quality and the naturalness of the signals are maintained, and the naturalness and the intelligibility of the signals can be maintained. Providing a higher compression ratio and better speech quality.

In some optional implementations of some embodiments, the executing body may perform feature fusion on the audio coding information and the at least one key text feature information through a feature fusion layer by executing the following steps to obtain fused feature information:

First, for each piece of audio piece encoding information of at least one piece of audio piece encoding information included in the audio encoding information, the following steps are performed:

And a first sub-step of determining the similarity of the audio segment coding information and each key text feature information in the at least one key text feature information to obtain a similarity set. In practice, the executing body may determine, through a similarity algorithm, a similarity between the audio segment encoding information and each of the at least one piece of key text feature information, so as to obtain a similarity set. Wherein, the similarity algorithm can be at least one of the following: cosine similarity algorithm, euclidean distance algorithm, weighted cosine similarity algorithm and Jacquard similarity coefficient.

And a second sub-step of determining the similarity satisfying the preset similarity condition in the similarity set as the target similarity. The preset similarity condition may be a similarity with the highest similarity in the similarity set.

And a third sub-step of determining a weighting coefficient of the audio fragment coding information and a weighting coefficient of the key text feature information corresponding to the target similarity according to the target similarity. As an example, in response to the target similarity being 0.9, the weighting coefficient of the key text feature information corresponding to the target similarity may be 0.9. The weighting factor of the audio clip encoding information may be 0.1. Therefore, for the audio fragment coding information and the key text feature information with higher similarity, the weighting coefficient corresponding to the key text feature information is larger, and when the similarity of the audio fragment coding information and the key text feature information is closest, the key text feature information can replace the audio fragment coding information.

And a fourth sub-step, according to the weighting coefficient of the audio fragment coding information and the weighting coefficient of the key text characteristic information corresponding to the target similarity, carrying out weighting processing on the audio fragment coding information and the key text characteristic information corresponding to the target similarity to obtain fragment fusion characteristic information. In practice, the execution subject may determine the sum of the product of the audio clip encoding information and the corresponding weighting coefficient and the product of the key text feature information and the corresponding weighting coefficient as the clip fusion feature information.

And secondly, splicing the obtained fusion characteristic information of each segment to obtain the fusion characteristic information.

Therefore, the audio fragment coding information corresponding to the key text characteristic information can be screened out from the audio coding information, and the key text characteristic information can be replaced in a targeted manner, so that the recognition accuracy of the key words is improved, and the influence on other audio fragments is reduced.

Step 204, determining whether the initial model is trained according to the sample text information corresponding to the sample audio information and the obtained text information.

In some embodiments, the executing entity may determine whether the initial model is trained based on the sample text information corresponding to the sample audio information and the obtained text information. In practice, first, the execution body may input the sample text information corresponding to the sample audio information and the obtained text information into a loss function, to obtain a text loss value. Then, in response to determining that the text loss value is less than the preset text loss value, determining that the initial model training is complete. And in response to determining that the text loss value is greater than or equal to the preset text loss value, determining that the initial model is not trained.

In response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model, step 205.

In some embodiments, the executing entity may determine the initial model as a speech recognition model in response to determining that the initial model training is complete.

Further, in response to determining that the initial model is not trained, the executing body may adjust parameters in the initial model, and perform training again by using the initial model after the parameters are adjusted corresponding to the unused samples.

With continued reference to fig. 3, a flow 300 of some embodiments of a speech recognition method according to the present disclosure is shown. The voice recognition method comprises the following steps:

step 301, obtaining audio information to be identified.

In some embodiments, an executing body of the speech recognition method (e.g., a computing device) may obtain the audio information to be recognized. The audio information to be identified may be a piece of audio that needs to be converted into text. As an example, the audio information to be recognized may be a speech of a speaker at a conference. In practice, the executing body can collect the audio information to be identified through the audio collecting device. The audio collection device may be a microphone assembly.

Step 302, generating recognition text information corresponding to the audio information to be recognized according to the audio information to be recognized, the keyword information and the voice recognition model.

In some embodiments, the executing body may generate the recognition text information corresponding to the audio information to be recognized according to the audio information to be recognized, the keyword information, and the speech recognition model. The speech recognition model may be generated according to the speech recognition model generation method. The keyword information may be a preset keyword.

Therefore, for the task quantity of transferring conference recordings, a speaker repeatedly refers to hot words distributed at equal tail parts such as certain names, place names or proprietary academic nouns, and key texts and audios are simultaneously input when a voice recognition model is used, and the voice recognition model can automatically and forcedly replace candidate words with maximum probability into the key texts in the process of generating the recognition texts, so that the recognition accuracy of specific words is improved, and the accuracy of the recognition texts is further improved.

In some optional implementations of some embodiments, according to the audio information to be recognized, the keyword information, and the speech recognition model, the executing entity may generate recognition text information corresponding to the audio information to be recognized by performing the following steps:

First, in response to detecting a keyword input operation, keyword information input by a user is acquired. The keyword input operation may be an operation of inputting a keyword in a specified keyword input box by a user.

And secondly, inputting the audio information to be identified and the keyword information input by the user into a voice identification model to obtain identification text information corresponding to the audio information to be identified.

As an example, the above-mentioned audio information to be recognized may be audio with the pronunciation "zh ā ng s ā n j ī n ti ā n zu co hu ǒ ch q lu n ch ng le". The keyword information may be "Zhang san", luancheng ". In converting text, "Zhang San" is the first order in the candidate word corresponding to the audio that sounds "zh ā ng s ā n" and is not generally identified as "Zhang Sanj". "Luancheng" is the first order in the candidate word corresponding to the audio that is pronounced as "lu n chang".

It should be noted that when using a speech recognition model, most of the text (including the names of famous characters and common names) is automatically recognized by the model, and only very individual keywords (e.g., homophones, remote names, and professional academic nouns) depend on user input.

Referring now to FIG. 4, a schematic diagram of a voice recognition device 400 suitable for use in implementing some embodiments of the present disclosure is shown. The voice recognition device in some embodiments of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, as well as stationary terminals such as digital TVs, desktop computers, and the like. The voice recognition device 400 may include a recording transcription device, for example, the voice recognition device may include an on-screen type recording pen. The terminal device shown in fig. 4 is only one example and should not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 4, the voice recognition apparatus 400 may include a processing device (e.g., a central processor, a graphics processor, etc.) 401 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the speech recognition apparatus 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, accelerometer, gyroscope, etc.; output devices 407 including, for example, speakers, vibrators, etc.; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the speech recognition device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows a speech recognition apparatus 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 4 may represent one device or a plurality of devices as needed. In some implementations, the input device 406 may include an audio acquisition apparatus. The audio collection device may include a microphone array and a housing for collecting audio information to be identified. The output means 407 may comprise a display device. The display device may be a Liquid Crystal Display (LCD) for displaying the identification text information corresponding to the audio information to be identified.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.

It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the voice recognition device; or may exist alone without being assembled into the speech recognition device. The computer-readable medium carries one or more programs that, when executed by the speech recognition device, cause the speech recognition device to: the sample audio information is encoded through an initial audio encoding sub-model included in the initial model, so that audio encoding information is obtained; performing feature extraction processing on each key text information in at least one key text information through an initial key text sub-model included in the initial model to obtain at least one key text feature information, wherein the key text information is generated according to sample text information corresponding to sample audio information; performing fusion decoding processing on the audio coding information and at least one key text characteristic information through an initial fusion decoding sub-model included in the initial model to obtain text information; determining whether the initial model is trained according to sample text information corresponding to the sample audio information and the obtained text information; in response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an encoding unit, a feature extraction unit, a fusion decoding unit, a first determination unit, and a second determination unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, an encoding unit may also be described as "a unit that encodes sample audio information by an initial audio encoding sub-model included in an initial model".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A speech recognition model generation method, comprising:

the sample audio information is encoded through an initial audio encoding sub-model included in the initial model, so that audio encoding information is obtained;

Performing feature extraction processing on each key text message in at least one key text message through an initial key text sub-model included in the initial model to obtain at least one key text feature message, wherein the key text message is generated according to sample text messages corresponding to sample audio messages, and the key text message is intercepted from the sample text messages corresponding to the sample audio messages;

performing fusion decoding processing on the audio coding information and at least one key text characteristic information through an initial fusion decoding sub-model included in the initial model to obtain text information;

determining whether the initial model is trained according to sample text information corresponding to the sample audio information and the obtained text information;

in response to determining that the initial model training is complete, the initial model is determined to be a speech recognition model.

2. The method of claim 1, wherein the key text information is generated by:

Performing part-of-speech word segmentation on the sample text information to obtain word sets;

Dividing the word set according to the part of speech corresponding to each word in the word set to obtain at least one word set;

Determining word groups meeting the preset part-of-speech conditions in the at least one word group as target word groups to obtain at least one target word group;

And inputting each target word in the target word groups into a pre-trained key text generation model for each target word group in the at least one target word group to obtain key text information corresponding to the target word groups, wherein the key text information comprises a key text set.

3. The method of claim 2, wherein the key text generation model is generated by the training steps of:

obtaining a sample, wherein the sample comprises a sample target word group and sample key text information corresponding to the sample target word, and the sample key text information comprises a sample place name key text set, a sample academic noun key text set and a sample name key text set;

Inputting a sample target word group into a place name key text generation submodel included in a model to be trained to obtain a place name key text set and a non-place name key text set;

Inputting the non-place name key text set into an academic noun key text generation sub-model included in the model to be trained, and obtaining an academic noun key text set and a non-academic noun key text set;

inputting the non-academic noun key text set into a name key text generation sub-model included in the model to be trained, and obtaining a name key text set and a non-name key text set;

Inputting the place name key text set, the academic noun key text set and the name key text set into an initial output submodel included in the model to be trained to obtain key text information;

Determining a sample place name key text set and place name loss values of the obtained place name key text set;

Determining a sample academic noun key text set and an academic noun loss value of the obtained academic noun key text set;

determining a sample name key text set and name loss values of the obtained name key text set;

Weighting the place name loss value, the academic noun loss value and the person name loss value according to the preset place name weight, the preset academic noun weight and the preset person name weight to obtain a total loss value corresponding to the key text information;

determining whether the model to be trained is trained according to the total loss value;

In response to determining that the model to be trained is not trained, adjusting relevant parameters in the model to be trained;

and determining the model to be trained as a key text generation model in response to determining that the training of the model to be trained is completed.

4. The method of claim 1, wherein the initial fusion decoding sub-model comprises a feature fusion layer, a combination layer, and a decoding layer; and

The method for performing fusion decoding processing on the audio coding information and the at least one key text feature information through the initial fusion decoding sub-model included in the initial model to obtain text information comprises the following steps:

Carrying out feature fusion on the audio coding information and at least one key text feature information through a feature fusion layer to obtain fusion feature information;

combining the fusion characteristic information and the audio coding information through a combination layer to obtain combined characteristic information;

And decoding the combined characteristic information through a decoding layer to obtain text information corresponding to the sample audio information.

5. The method according to claim 4, wherein the encoding the sample audio information by the initial audio encoding sub-model included in the initial model to obtain audio encoded information includes:

preprocessing sample audio information through an input layer included in the initial audio coding sub-model to obtain preprocessed audio information;

carrying out frame division processing on the preprocessed audio information through a frame division layer included in the initial audio coding sub-model to obtain an audio frame set;

carrying out normalization processing on each audio frame in the audio frame set through a normalization layer included in the initial audio coding sub-model to obtain a normalized audio frame set;

Performing endpoint segmentation on the normalized audio frame set through an endpoint segmentation layer included in the initial audio coding sub-model to obtain an audio fragment set;

and carrying out coding processing on the audio fragment set through a coding layer included in the initial audio coding sub-model to obtain audio coding information, wherein the audio coding information comprises at least one audio fragment coding information.

6. The method according to claim 5, wherein the feature fusion of the audio coding information and the at least one key text feature information by the feature fusion layer to obtain fused feature information comprises:

for each piece of audio coding information of at least one piece of audio coding information included in the audio coding information, performing the steps of:

Determining the similarity of the audio fragment coding information and each piece of key text feature information in the at least one piece of key text feature information to obtain a similarity set;

Determining the similarity meeting the preset similarity condition in the similarity set as target similarity;

According to the target similarity, determining a weighting coefficient of the audio fragment coding information and a weighting coefficient of key text characteristic information corresponding to the target similarity;

Weighting the audio fragment coding information and the key text feature information corresponding to the target similarity according to the weighting coefficient of the audio fragment coding information and the weighting coefficient of the key text feature information corresponding to the target similarity to obtain fragment fusion feature information;

And splicing the obtained fusion characteristic information of each segment to obtain the fusion characteristic information.

7. A method of speech recognition, comprising:

Acquiring audio information to be identified;

Generating recognition text information corresponding to the audio information to be recognized according to the audio information to be recognized, the keyword information and the voice recognition model, wherein the voice recognition model is generated according to the voice recognition model generation method of any one of claims 1-6.

8. The method of claim 7, wherein the generating the recognition text information corresponding to the audio information to be recognized according to the audio information to be recognized, the keyword information, and the speech recognition model includes:

In response to detecting a keyword input operation, acquiring keyword information input by a user;

And inputting the audio information to be identified and the keyword information input by the user into a voice identification model to obtain identification text information corresponding to the audio information to be identified.

9. A speech recognition device comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the audio acquisition equipment is used for acquiring audio information to be identified;

the display device is used for displaying the identification text information corresponding to the audio information to be identified;

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.

10. The voice recognition device of claim 9, wherein the voice recognition device further comprises input means for inputting keyword information.

11. A computer readable medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-8.