CN114765025A

CN114765025A - Method for generating and recognizing speech recognition model, device, medium and equipment

Info

Publication number: CN114765025A
Application number: CN202210441630.7A
Authority: CN
Inventors: 马娆; 吴璟成; 马泽君
Original assignee: Lemon Inc Cayman Island
Current assignee: Lemon Inc Cayman Island
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-19
Also published as: WO2023211369A2; WO2023211369A3

Abstract

The present disclosure relates to a method, a device, a medium, and an apparatus for generating a speech recognition model, the method including: acquiring a target named entity word list; screening preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words; performing voice synthesis processing on the target text data to determine target audio data; determining target training data based on the target audio data; retraining the pre-trained speech recognition model based on initial training data and the target training data to obtain a target speech recognition model, wherein the initial training data is audio data used for training the pre-trained speech recognition model. The target voice recognition model obtained by the voice recognition model generation method can improve the recognition accuracy of the named entity words.

Description

Method for generating voice recognition model, recognition method, device, medium and equipment

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a medium, and a device for generating a speech recognition model.

Background

With the development of deep learning technology, the application of a speech recognition model is more and more extensive, and in related technologies, the speech recognition model is usually obtained in an end-to-end training mode, and the recognition effect of the speech recognition model on audio is influenced by training data. For some words rarely appearing in training data, the recognition effect of the speech recognition model is poor, and therefore, how to improve the recognition accuracy of the speech recognition model on the words is a technical problem which needs to be solved urgently.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for generating a speech recognition model, including:

acquiring a target named entity word list, wherein the target named entity word list comprises a plurality of named entity words;

screening preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words;

performing voice synthesis processing on the target text data to determine target audio data;

determining target training data based on the target audio data;

retraining the pre-trained speech recognition model based on initial training data and the target training data to obtain a target speech recognition model, wherein the initial training data is audio data used for training the pre-trained speech recognition model.

In a second aspect, the present disclosure provides a speech recognition method, including:

acquiring audio to be identified;

and processing the audio to be recognized according to a target speech recognition model to obtain a target text recognition result, wherein the target speech recognition model is generated according to the method of the first aspect.

In a third aspect, the present disclosure provides a speech recognition model generation apparatus, including:

a first obtaining module configured to obtain a target named entity vocabulary, the target named entity vocabulary including a plurality of named entity words;

the screening module is configured to screen preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words;

a first determining module, configured to perform speech synthesis processing on the target text data, and determine target audio data;

a second determination module configured to determine target training data based on the target audio data;

a training module configured to retrain the pre-trained speech recognition model based on initial training data and the target training data to obtain a target speech recognition model, wherein the initial training data is audio data used for training the pre-trained speech recognition model.

In a fourth aspect, the present disclosure provides a speech recognition apparatus comprising:

the second acquisition module is configured to acquire the audio to be identified;

a processing module configured to process the audio to be recognized according to a target speech recognition model to obtain a target text recognition result, where the target speech recognition model is generated according to the method of the first aspect.

In a fifth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a sixth aspect, the present disclosure provides an electronic device comprising:

a storage device having at least one computer program stored thereon;

at least one processing device for executing the at least one computer program in the storage device to perform the steps of the method of the first aspect.

Through the technical scheme, the target speech recognition model is obtained through training of the target training data constructed by the target audio data containing the named entity words, and the recognition effect of the target speech recognition model on the named entity words is improved through expanding the training data containing the named entity words. And through retraining the trained voice recognition model, model parameters of the pre-trained voice recognition model are obtained by training based on initial training data, so that the pre-trained voice recognition model has a good recognition effect on most audio data, and further the general recognition performance of the target voice recognition model obtained by retraining is not damaged, so that the recognition effect on named entity words is improved, the general recognition performance is not influenced, and the stability of the target recognition voice model is improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic diagram illustrating one implementation environment in accordance with an exemplary embodiment of the present disclosure.

FIG. 2 is a flow chart illustrating a method of generating a speech recognition model according to an exemplary embodiment of the present disclosure.

FIG. 3 is a flowchart illustrating training of a target speech recognition model according to an exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating obtaining a target named entity word list according to an exemplary embodiment of the present disclosure.

FIG. 5 is a flow chart illustrating a method of speech recognition according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating a speech recognition model generation apparatus according to an exemplary embodiment of the present disclosure.

Fig. 7 is a block diagram illustrating a speech recognition apparatus according to an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of an electronic device shown in accordance with an exemplary embodiment of the present disclosure.

Detailed Description

It is understood that before the technical solutions disclosed in the embodiments of the present disclosure are used, the type, the use range, the use scene, etc. of the personal information related to the present disclosure should be informed to the user and obtain the authorization of the user through a proper manner according to the relevant laws and regulations.

For example, in response to receiving a user's active request, prompt information is sent to the user to explicitly prompt the user that the requested operation to be performed would require acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an alternative but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window manner, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.

It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.

At the same time, it is understood that the data involved in the present disclosure (including but not limited to the data itself, the acquisition or use of the data) should comply with the requirements of the relevant laws and regulations and related regulations.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

All actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

In the related art, the speech recognition model is usually obtained by an end-to-end training method, for example, the speech recognition model may be an encoding-decoding model obtained by an end-to-end training method, and the recognition effect on the audio is affected by the training data. For some words that rarely occur in the training data, the recognition effect of the speech recognition model is poor. The named entity words are entity words with specific meanings in the text, such as place names, food names, movie names, action names, organization names, proper nouns and the like, and due to the particularity and novelty of the named entity words, the named entity words are usually not frequently appeared in the training data, so that the recognition effect of the model on the named entity words is poor.

In some embodiments, the named entity word may be decoded and score-emphasized by a language model externally hung on the speech recognition model to obtain a recognition result of the named entity word. However, this approach has the following drawbacks: (1) because the language model changes the decoding logic of the voice recognition model, compared with the method of directly carrying out end-to-end recognition on the named entity words through the voice recognition model, the method increases additional calculated amount, and the calculated amount is increased along with the increase of the volume of the language model; (2) the plug-in language model usually causes certain damage to the general recognition performance of the speech recognition model, so that the overall recognition accuracy of the text to which the named entity words belong is influenced, and under the condition that the number of the named entity words is large, the recognition accuracy is very poor.

FIG. 1 is a schematic diagram illustrating one implementation environment according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the implementation environment may include: a model training device 110 and a model using device 120. In some embodiments, the model training device 110 may be a computer device, such as a computer, server, or the like, for training a target speech recognition model. The model training device 110 may train to obtain the target speech recognition model by using a machine learning method, and the training process of the target speech recognition model may refer to fig. 3 and the related description thereof, which are not described herein again.

The trained target speech recognition model can be deployed in the model using device 120 for use. The model-using device 120 may be a terminal device such as a mobile phone, a tablet computer, a personal computer, a multimedia playing device, or a server. The model using device 120 may process the audio to be recognized through the target speech recognition model to obtain a target text recognition result. For specific details of obtaining the target text recognition result, reference may be made to fig. 5 and the related description thereof, which are not described herein again.

FIG. 2 is a flow chart illustrating a method of generating a speech recognition model according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the method may include the following steps.

Step 210, a target named entity vocabulary is obtained, wherein the target named entity vocabulary comprises a plurality of named entity words.

In some embodiments, the named entity word may be an entity word having a particular meaning, and the named entity word may include at least one of: place name, food name, movie name, action name, organization name, and proper noun.

In some embodiments, the target named entity word list may be a pre-obtained named entity word list, for example, the target named entity word list may be pre-obtained by counting the named entity words in the public text data, and the number of the named entity words included in the target named entity word list may be specifically determined according to actual situations, for example, 100 or 200. The public text data can be specifically determined according to actual situations, for example, the public text data can be published documents, books, articles and the like, and the disclosure does not set any limit to the public text data.

In some embodiments, the target named entity vocabulary may be obtained by filtering the initial named entity vocabulary according to a second preset condition. The initial named entity vocabulary may be the aforementioned pre-obtained named entity vocabulary, and for details of the pre-obtained named entity vocabulary, reference may be made to the related description above, which is not repeated herein.

And step 220, screening preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words.

In some embodiments, the pre-set textual data may include public textual data and/or platform textual data. The platform text data may include comment text authorized to be used by the user, and the comment text authorized to be used by the user may be a comment sent in text form by the user in an application program of the platform. For specific details of the public text data, reference may be made to step 210 and its related description, which are not repeated herein.

In some embodiments, the target text data may be text in the predetermined text data that includes one or more named entity words in the target named entity word list. The character length of the target text data may be set in advance, and for example, the target text data may be a long sentence or a short sentence, or the like.

Step 230, performing speech synthesis processing on the target text data to determine target audio data.

The target audio data may be audio data containing one or more named entity words in the target named entity vocabulary. In some embodiments, the target audio data may be obtained by performing a speech synthesis process according to the target text data. In some embodiments, a speech synthesis process may be used to convert text data into audio data.

In some embodiments, performing a speech synthesis process on the target text data to determine target audio data comprises: and performing voice synthesis processing on the target text data according to a pre-trained voice synthesis model to determine target audio data. In some embodiments, the speech synthesis model may be obtained by training a plurality of sample texts carrying sample labels by machine learning.

In some embodiments, the speech synthesis model may be trained based on: obtaining a plurality of sample texts carrying sample labels; and performing model training on the initial speech synthesis model through a plurality of sample texts to obtain the speech synthesis model. The sample tags may include audio features, such as mel-frequency spectral features.

In some embodiments, parameters of the initial speech synthesis model may be iteratively updated based on the plurality of sample texts to reduce the loss function value corresponding to each sample text, resulting in a trained speech synthesis model. In some embodiments, the loss function value corresponding to each sample text is determined by: inputting the sample text into a speech synthesis model to obtain predicted audio features; based on the difference of the predicted audio features and the sample labels, a loss function value is determined. In some embodiments, the predicted audio features may be output by a decoder comprised by the speech synthesis model. The speech synthesis model may adopt a tacontron model, and specific details of the tacontron model may be referred to in the related description, which is not described herein again.

During the training of the speech synthesis model, parameters of the speech synthesis model may be continuously updated based on the plurality of sample texts. For example, the parameters of the speech synthesis model may be continuously adjusted to reduce the loss function value corresponding to each sample text so that the loss function value satisfies the preset condition. For example, the loss function value converges, or the loss function value is less than a preset value. And when the loss function value meets the preset condition, finishing model training to obtain a trained voice synthesis model.

As described above, the target audio data may be obtained by performing speech synthesis processing according to the target text data, and in some embodiments, the target audio data may be obtained by screening the platform audio data based on the named entity words in the target named entity word list. The platform audio data may be voice comments that the user has authorized to use, and the voice comments that the user has authorized to use may be comments that the user sends in the form of voice in an application program of the platform.

At step 240, target training data is determined based on the target audio data.

In some embodiments, the target audio data may be determined directly as the target training data. In some embodiments, the target audio data may be subjected to a noise adding process and/or a variable speed process to obtain processed target audio data, and the processed target audio data may be determined as the target training data.

In some embodiments, the target training data may be determined based on the target audio data and the processed target audio data. In some embodiments, determining target training data based on the target audio data comprises: carrying out noise adding processing and/or variable speed processing on the target audio data to obtain processed target audio data; target training data is determined based on the target audio data and the processed target audio data. In some embodiments, the target audio data and the processed target audio data may be mixed according to a preset ratio to determine the target training data. The preset ratio may be specifically determined according to actual conditions, for example, the preset ratio may be 1:2, and the like.

In some embodiments, the noise adding process may refer to adding noise to the target audio data, and the variable speed process may refer to accelerating or decelerating the target audio data. In some embodiments, the noise addition process and/or the variable speed process may be implemented by an audio processing tool. By means of noise adding processing and/or variable speed processing, the diversity of training data can be improved, and the recognition accuracy of a target voice recognition model obtained by subsequent training is higher.

And 250, retraining the pre-trained voice recognition model based on the initial training data and the target training data to obtain the target voice recognition model, wherein the initial training data is audio data used for training the pre-trained voice recognition model.

In some embodiments, the initial training data may be audio data that does not contain named entity words or audio data that contains a small number of named entity words. In some embodiments, the language of the audio data corresponding to the initial training data and the target training data may be any language, for example, chinese, english, german, etc., and the trained target speech recognition model may be applied to processing of the audio to be recognized in various languages, which is not limited in this disclosure.

In some embodiments, retraining the pre-trained speech recognition model based on the initial training data and the target training data may include: and adjusting model parameters of the pre-trained voice recognition model based on the initial training data and the target training data to obtain a target voice recognition model, wherein the model parameters of the pre-trained voice recognition model are obtained by training the voice recognition model based on the initial training data. For specific details of training the target speech recognition model, reference may be made to fig. 3 and the related description thereof, which are not repeated herein.

According to the embodiment of the invention, the target speech recognition model is obtained through training the target training data constructed by the target audio data containing the named entity words, and the recognition effect of the target speech recognition model on the named entity words is improved by expanding the training data containing the named entity words. And by retraining the trained voice recognition model, model parameters of the pre-trained voice recognition model are obtained by training based on initial training data, so that the pre-trained voice recognition model has a good recognition effect on most audio data, and the recognition effect of the target voice recognition model obtained by retraining on a general data set (for example, audio data not containing named entity words) cannot be reduced. Therefore, the general recognition performance of the target speech recognition model obtained through retraining cannot be damaged, so that the recognition effect on the named entity words is improved, meanwhile, the general recognition performance cannot be influenced, and the stability of the target recognition speech model is improved.

FIG. 3 is a flow chart illustrating training of a target speech recognition model according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the method may include the following steps.

And 310, respectively processing the initial training data and the target training data according to the pre-trained voice recognition model to obtain respective text recognition results.

And step 320, determining a loss function value of a pre-trained speech recognition model based on the difference between the respective text recognition result and the label carried by the training data corresponding to the text recognition result, wherein the label is obtained based on the text data corresponding to the training data.

And 330, iteratively adjusting parameters of the pre-trained voice recognition model according to the loss function value to obtain a target voice recognition model.

In some embodiments, the tags may be real text data of the audio data to which the training data corresponds. The labels can be obtained in advance in a manual labeling mode.

In some embodiments, the loss function values may include a first loss function value derived based on a difference in a probability distribution between the text recognition result and the tag, and/or a second loss function value derived based on an error rate between the text recognition result and the tag; iteratively adjusting parameters of a pre-trained voice recognition model according to the loss function value to obtain a target voice recognition model, comprising: and iteratively adjusting the parameters of the pre-trained voice recognition model according to the first loss function value and/or the second loss function value until the voice recognition model with the parameters meeting the first preset condition is obtained, and determining the voice recognition model as a target voice recognition model.

In some embodiments, the first loss function value may comprise a cross entropy or a relative entropy, etc., and the second loss function value may comprise a word error rate, a sentence error rate, a word error rate, etc. For specific details of the exemplary first loss function value and the second loss function value, reference may be made to the related art, and details thereof are not repeated herein.

In some embodiments, iteratively adjusting the parameters of the pre-trained speech recognition model according to the first loss function value and the second loss function value may refer to: and iteratively adjusting parameters of the pre-trained speech recognition model based on a fusion result of the first loss function value and the second loss function value, wherein the fusion result can be the loss function value after weighted averaging.

In some embodiments, the parameter satisfying the first predetermined condition may be that the loss function value corresponding to the parameter converges or is smaller than a predetermined value. In some embodiments, iteratively adjusting the parameters of the pre-trained speech recognition model according to the loss function values may refer to: and iteratively adjusting parameters of the pre-trained speech recognition model based on the loss function values of the training data in the initial training data and the target training data. For example, the parameters of the speech synthesis model may be continuously adjusted to reduce the loss function value corresponding to each training data, so that the loss function value satisfies the preset condition. For example, the loss function value converges, or the loss function value is less than a preset value. And when the loss function value meets the preset condition, finishing model training to obtain a trained target speech recognition model.

In the embodiment of the present disclosure, the parameters of the pre-trained speech recognition model are iteratively adjusted through a plurality of loss function values (for example, a first loss function value and a second loss function value), so that the performance of the finally obtained target speech recognition model, such as the recognition accuracy and stability of the model, can be improved.

Fig. 4 is a flowchart illustrating obtaining a target named entity word list according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the method may include the following steps.

Step 410, obtaining an initial named entity vocabulary, wherein the initial named entity vocabulary comprises named entity words of which the number is larger than that of the target named entity vocabulary.

And step 420, screening out the named entity words meeting a second preset condition from the initial named entity word list based on the number of the named entity words in the initial named entity word list in the preset text data and/or the number of the named entity words in the initial training data.

Step 430, determining a target named entity word list based on the named entity words meeting the second preset condition.

It can be appreciated that the target named entity vocabulary is obtained by filtering the initial named entity vocabulary, and thus the initial named entity vocabulary can include a greater number of named entity words than the target named entity vocabulary. For details of the initial named entity vocabulary, refer to step 210 and related description thereof, which are not repeated herein.

In some embodiments, the screening out the named entity words from the initial named entity word list based on the number of the named entity words in the initial named entity word list in the preset text data and/or the number of the named entity words in the initial training data, the named entity words satisfying a second preset condition includes: and screening out the named entity words of which the number in the preset text data is greater than a first preset threshold value and/or the number in the initial training data is less than a second preset threshold value from the initial named entity word list, and determining the named entity words as the named entity words meeting a second preset condition.

In some embodiments, the number in the preset text data being greater than the first preset threshold may refer to: the amount in the public text data and/or the platform text data is greater than a first preset threshold. The number in the public text data and/or the platform text data greater than the first preset threshold may indicate that the corresponding named entity word is important. In some embodiments, the first preset threshold and the second preset threshold may be specifically determined according to actual situations, and the disclosure does not set any limitation thereto.

The important named entity words and/or the named entity words with the quantity smaller than a first preset threshold value in the initial training data are screened out to construct a target named entity word list, so that the named entity words in the target named entity word list are more targeted to the training of the voice recognition model, and the recognition accuracy of the target recognition model to the named entity words is optimized in a targeted manner.

In some embodiments, the method for generating a speech recognition model may further include: determining target test data based on the target audio data and the processed target audio data; and testing the target voice recognition model according to the target test data. For specific details of determining the target test data similar to the target training data, refer to step 240 and the related description thereof, which are not repeated herein.

FIG. 5 is a flow chart illustrating a method of speech recognition according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the method may include the following steps.

Step 510, obtaining the audio to be identified.

And step 520, processing the audio to be recognized according to the target voice recognition model to obtain a target text recognition result.

In some embodiments, the audio to be recognized may be audio data that needs to be recognized as text, e.g., speech, music, etc. In some embodiments, the target speech recognition model may be obtained by retraining the pre-trained speech recognition model based on the initial training data and the target training data, and for specific details of the target speech recognition model, reference may be made to step 250 and related description thereof, which are not described herein again.

Fig. 6 is a block diagram illustrating a speech recognition model generation apparatus according to an exemplary embodiment of the present disclosure, the speech recognition model generation apparatus 600 including:

a first obtaining module 610 configured to obtain a target named entity vocabulary, the target named entity vocabulary including a plurality of named entity words;

the screening module 620 is configured to screen preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words;

a first determining module 630, configured to perform speech synthesis processing on the target text data, and determine target audio data;

a second determination module 640 configured to determine target training data based on the target audio data;

a training module 650 configured to retrain the pre-trained speech recognition model based on initial training data and the target training data to obtain a target speech recognition model, wherein the initial training data is audio data used for training the pre-trained speech recognition model.

In some embodiments, the second determination module 640 is further configured to:

carrying out noise adding processing and/or variable speed processing on the target audio data to obtain processed target audio data;

determining the target training data based on the target audio data and the processed target audio data.

In some embodiments, the training module 650 is further configured to:

respectively processing the initial training data and the target training data according to the pre-trained voice recognition model to obtain respective text recognition results;

determining a loss function value of the pre-trained speech recognition model based on the difference between the respective text recognition result and a label carried by training data corresponding to the text recognition result, wherein the label is obtained based on the text data corresponding to the training data;

and iteratively adjusting the parameters of the pre-trained voice recognition model according to the loss function value to obtain the target voice recognition model.

In some embodiments, the loss function values include a first loss function value based on a difference in a probability distribution between the text recognition result and the tag, and/or a second loss function value based on an error rate between the text recognition result and the tag; the training module 650 is further configured to:

and iteratively adjusting parameters of the pre-trained voice recognition model according to the first loss function value and/or the second loss function value until a voice recognition model with the parameters meeting a first preset condition is obtained, and determining the voice recognition model as the target voice recognition model.

In some embodiments, the first determination module 630 is further configured to:

and performing the voice synthesis processing on the target text data according to a pre-trained voice synthesis model to determine the target audio data.

In some embodiments, the first obtaining module 610 is further configured to:

acquiring an initial named entity word list, wherein the number of named entity words included in the initial named entity word list is greater than that of the target named entity word list;

screening naming entity words meeting a second preset condition from the initial naming entity word list based on the number of each naming entity word in the initial naming entity word list in the preset text data and/or the number of each naming entity word in the initial training data;

and determining the target named entity word list based on the named entity words meeting the second preset condition.

In some embodiments, the first obtaining module 610 is further configured to:

and screening out the named entity words of which the number in the preset text data is greater than a first preset threshold value and/or the number in the initial training data is less than a second preset threshold value from the initial named entity word list, and determining the named entity words as the named entity words meeting a second preset condition.

Fig. 7 is a block diagram illustrating a voice recognition apparatus according to an exemplary embodiment of the present disclosure, the voice recognition apparatus 700 including:

a second obtaining module 710 configured to obtain the audio to be recognized;

and the processing module 720 is configured to process the audio to be recognized according to the target speech recognition model to obtain a target text recognition result.

Referring now to fig. 8, a schematic diagram of an electronic device (e.g., a terminal device or server of fig. 1) 800 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries at least one program that, when executed by the electronic device, causes the electronic device to: acquiring a target named entity word list, wherein the target named entity word list comprises a plurality of named entity words; screening preset text data based on the named entity words in the target named entity word list to obtain target text data containing the named entity words; performing voice synthesis processing on the target text data to determine target audio data; determining target training data based on the target audio data; retraining the pre-trained speech recognition model based on initial training data and the target training data to obtain a target speech recognition model, wherein the initial training data is audio data used for training the pre-trained speech recognition model.

Alternatively, the computer readable medium carries at least one program which, when executed by the electronic device, causes the electronic device to: acquiring audio to be identified; and processing the audio to be recognized according to the voice recognition model to obtain a target text recognition result.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a method of generating a speech recognition model, according to one or more embodiments of the present disclosure, including:

determining target training data based on the target audio data;

Example 2 provides the method of example 1, the determining target training data based on the target audio data, including, in accordance with one or more embodiments of the present disclosure:

Example 3 provides the method of example 1 or 2, and the retraining the pre-trained speech recognition model based on the initial training data and the target training data to obtain the target speech recognition model includes:

Example 4 provides the method of example 3, the loss function value including a first loss function value based on a difference in a probability distribution between the text recognition result and the tag, and/or a second loss function value based on an error rate between the text recognition result and the tag;

the iteratively adjusting the parameters of the pre-trained speech recognition model according to the loss function value to obtain the target speech recognition model includes:

and iteratively adjusting the parameters of the pre-trained speech recognition model according to the first loss function value and/or the second loss function value until the speech recognition model with the parameters meeting a first preset condition is obtained, and determining the speech recognition model as the target speech recognition model.

Example 5 provides the method of example 1, the performing speech synthesis processing on the target text data to determine target audio data, including:

Example 6 provides the method of example 1, wherein obtaining the target named entity vocabulary, includes:

Example 7 provides the method of example 6, the screening out named entity words from the initial named entity word list that satisfy a second preset condition based on the number of the named entity words in the initial named entity word list in the preset text data and/or the number in the initial training data, including:

and screening out the named entity words of which the number in the preset text data is larger than a first preset threshold value and/or the number in the initial training data is smaller than a second preset threshold value from the initial named entity word list, and determining the named entity words as the named entity words meeting a second preset condition.

Example 8 provides a speech recognition method, according to one or more embodiments of the present disclosure, including:

acquiring audio to be identified;

processing the audio to be recognized according to a target speech recognition model to obtain a target text recognition result, wherein the target speech recognition model is generated according to the method of any one of examples 1-7.

Example 9 provides a speech recognition model generation apparatus, according to one or more embodiments of the present disclosure, including:

and the training module is configured to retrain the pre-trained voice recognition model based on initial training data and the target training data to obtain a target voice recognition model, wherein the initial training data is audio data used for training the pre-trained voice recognition model.

Example 10 provides the apparatus of example 9, the second determination module further configured to:

Example 11 provides the apparatus of example 9 or 10, the training module further configured to:

Example 12 provides the apparatus of example 11, the loss function values including a first loss function value based on a difference in a probability distribution between the text recognition result and the tag, and/or a second loss function value based on an error rate between the text recognition result and the tag; the training module is further configured to:

Example 13 provides the apparatus of example 9, the first determination module further configured to:

Example 14 provides the apparatus of example 9, the first obtaining module further configured to:

Example 15 provides the apparatus of example 14, the first obtaining module further configured to:

Example 16 provides, in accordance with one or more embodiments of the present disclosure, a speech recognition apparatus comprising:

a processing module configured to process the audio to be recognized according to a target speech recognition model, so as to obtain a target text recognition result, where the target speech recognition model is trained according to the method of any one of examples 1 to 7.

Example 17 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-8, in accordance with one or more embodiments of the present disclosure.

Example 18 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing device for executing the at least one computer program in the storage device to implement the steps of the method of any of examples 1-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for generating a speech recognition model, comprising:

determining target training data based on the target audio data;

2. The method of claim 1, wherein determining target training data based on the target audio data comprises:

3. The method according to claim 1 or 2, wherein the retraining the pre-trained speech recognition model based on the initial training data and the target training data to obtain a target speech recognition model comprises:

determining a loss function value of the pre-trained speech recognition model based on a difference between a respective text recognition result and a label carried by training data corresponding to the text recognition result, wherein the label is obtained based on the text data corresponding to the training data;

4. The method of claim 3, wherein the loss function value comprises a first loss function value based on a difference in a probability distribution between the text recognition result and the label, and/or a second loss function value based on an error rate between the text recognition result and the label;

5. The method of claim 1, wherein performing a speech synthesis process on the target text data to determine target audio data comprises:

6. The method of claim 1, wherein obtaining the target named entity vocabulary comprises:

7. The method according to claim 6, wherein the screening out the named entity words satisfying a second preset condition from the initial named entity word list based on the number of the named entity words in the initial named entity word list in the preset text data and/or the number of the named entity words in the initial training data comprises:

8. A speech recognition method, comprising:

acquiring audio to be identified;

processing the audio to be recognized according to a target speech recognition model to obtain a target text recognition result, wherein the target speech recognition model is generated according to the method of any one of claims 1 to 7.

9. A speech recognition model generation apparatus, comprising:

10. A speech recognition apparatus, comprising:

a processing module configured to process the audio to be recognized according to a target speech recognition model to obtain a target text recognition result, wherein the target speech recognition model is generated according to the method of any one of claims 1 to 7.

11. A computer-readable medium, on which a computer program is stored which, when being executed by a processing means, carries out the steps of the method according to any one of claims 1 to 8.

12. An electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing device for executing the at least one computer program in the storage device to carry out the steps of the method according to any one of claims 1 to 8.