CN115547333A

CN115547333A - Method, device, system, equipment and medium for generating language recognition model

Info

Publication number: CN115547333A
Application number: CN202211216345.1A
Authority: CN
Inventors: 杨秦露丹; 范利春; 曾静; 刘畅
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2022-12-30

Abstract

The present disclosure relates to a method, apparatus, system, device, and medium for generating a language identification model. The method comprises the following steps: generating a first basic model according to the first text corpus; determining the vertical domain category to which each tagged text in the first text corpus belongs, and counting the number of texts corresponding to each vertical domain category; determining a preset number of vertical domain categories with the maximum text number as target vertical domain categories; aiming at each target vertical domain category, generating a target vertical domain category model corresponding to the target vertical domain category according to the labeled text corresponding to the target vertical domain category; and generating a language recognition model according to the first basic model and each target vertical domain category model. Therefore, the purposes of quick iteration and updating can be achieved, the efficiency of the iteration and updating of the language recognition model is improved, and the accuracy of the speech recognition system for the speech recognition of the user is further improved.

Description

Method, device, system, equipment and medium for generating language recognition model

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a system, a device, and a medium for generating a language recognition model.

Background

Speech recognition technology is a technology that converts human speech into computer-readable input. The voice recognition technology is widely applied to the fields of voice dialing, voice navigation, automatic equipment control and the like. Currently, human speech is mostly converted to text by speech recognition systems. For example, the speech recognition system mostly uses a language recognition model and an acoustic model, and the language recognition model is a model for calculating the probability of a sentence, i.e. the probability of judging whether a sentence is in accordance with human language. With the development of statistical models, most of the language recognition models in the speech recognition system are N-gram language recognition models.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, an apparatus, a system, a device, and a medium for generating a language identification model.

According to a first aspect of the embodiments of the present disclosure, a method for generating a language recognition model is provided, including:

generating a first basic model according to a first text corpus, wherein the first text corpus is a pre-collected labeled text corresponding to user voice;

determining the vertical domain category of each labeled text in the first text corpus, and counting the text quantity corresponding to each vertical domain category;

determining a preset number of vertical domain categories with the maximum text number as target vertical domain categories;

aiming at each target vertical domain category, generating a target vertical domain category model corresponding to the target vertical domain category according to the labeled text corresponding to the target vertical domain category;

and generating a language identification model according to the first basic model and each target vertical domain category model.

Optionally, generating a first base model according to the first text corpus includes:

generating an online data model according to the first text corpus; and

determining a labeled text corresponding to the user voice with the error identification in the first text corpus, and generating a first error correction model according to the labeled text corresponding to the user voice with the error identification;

and generating a first basic model according to the online data model and the first error correction model.

Optionally, the generating a first basic model according to the first text corpus further includes:

determining a sentence pattern of each labeled text in the first text corpus, and determining the sentence pattern with the occurrence frequency larger than a preset threshold value as a target sentence pattern;

acquiring a second text corpus constructed by the user according to the target sentence pattern, and generating a new data model according to the second text corpus;

generating a first base model according to the online data model and the first error correction model, including:

and carrying out interpolation combination on the online data model, the first error correction model and the newly added data model to generate a first basic model.

Optionally, the method further comprises:

generating a multi-vertical-domain category model according to the labeled texts corresponding to other vertical-domain categories except the target vertical-domain category;

generating the language identification model according to the first base model and each target vertical domain category model, including:

generating a second basic model according to the first basic model and the multi-vertical domain category model;

and generating a language identification model according to the second basic model and each target vertical domain category model.

Optionally, the method further comprises:

acquiring a hot spot resource text in a preset time period, and generating a resource model according to the hot spot resource text; and

acquiring a labeled text corresponding to the user voice with the recognition error in the current time period, and generating a second error correction model according to the labeled text corresponding to the user voice with the recognition error in the current time period;

carrying out interpolation combination on the resource model and the second error correction model to generate a dynamic model;

generating a language identification model according to the first base model and each target vertical domain category model, wherein the generating of the language identification model comprises the following steps:

and generating a language identification model according to the first basic model, each target vertical domain type model and the dynamic model.

Optionally, the update frequency of the dynamic model is greater than the update frequency of the second base model.

Optionally, the method further comprises:

acquiring a requirement text corpus related to a requirement service, which is input by a user, and generating a service requirement model according to the requirement text corpus;

and generating a language identification model according to the first basic model, each target vertical domain type model and the service demand model.

According to a second aspect of the embodiments of the present disclosure, there is provided a generation apparatus of a language recognition model, including:

the first generating module is configured to generate a first basic model according to a first text corpus, wherein the first text corpus is a pre-collected labeled text corresponding to the user voice;

the first determining module is configured to determine a vertical domain category to which each labeled text in the first text corpus belongs, and count the number of texts corresponding to each vertical domain category;

the second determining module is configured to determine a preset number of vertical domain categories with the largest text number as target vertical domain categories;

the second generation module is configured to generate a target vertical domain category model corresponding to each target vertical domain category according to the labeled text corresponding to the target vertical domain category;

and the third generation module is configured to generate a language recognition model according to the first base model and each target vertical domain category model.

According to a third aspect of embodiments of the present disclosure, there is provided a speech recognition system comprising a feature extraction model, an acoustic model, a language recognition model, a speech decoding and searching model, wherein the language recognition model is generated according to the method of the first aspect of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

determining the vertical domain type of each labeled text in the first text corpus, and counting the number of texts corresponding to each vertical domain type;

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

firstly, a first text corpus is utilized to generate a first basic model, then, the vertical domain category to which each labeled text in the first text corpus belongs is determined, the text quantity corresponding to each vertical domain category is counted, the vertical domain category with the preset quantity and the largest text quantity is determined as the target vertical domain category, a target vertical domain category model corresponding to the target vertical domain category is generated according to the labeled text corresponding to the target vertical domain category aiming at each target vertical domain category, and finally, a language recognition model is generated according to the first basic model and the target vertical domain category model. Therefore, the language recognition model is generated by superposition of the multiple layers of models, when the language recognition model is updated subsequently, the language recognition model can be updated only by updating the first basic model and the target vertical domain type model, so that the aims of quick iteration and updating can be fulfilled, the iteration and updating efficiency of the language recognition model is improved, and the accuracy of the speech recognition system for user speech recognition is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a speech recognition system shown in accordance with an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of generating a language identification model in accordance with an exemplary embodiment.

FIG. 3 is a diagram illustrating a language recognition model in accordance with an exemplary embodiment.

FIG. 4 is a block diagram illustrating an apparatus for generating a language identification model in accordance with an exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that all actions of acquiring signals, information or data in the present application are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

In a speech recognition system, a language recognition model can expand the generalization capability of the model in a shorter time than an acoustic model. Therefore, in order to improve the accuracy of the speech recognition system for the user speech recognition, the efficiency of the iteration or update of the speech recognition model needs to be improved.

In view of the above, the present disclosure provides a method, an apparatus, a system, a device and a medium for generating a language recognition model, so as to improve the iteration or updating efficiency of the language recognition model, thereby improving the recognition accuracy of a speech recognition system.

First, a speech recognition system will be described.

Generally, a speech recognition system may include four parts of a feature extraction model, an acoustic model, a language recognition model, speech decoding, and a search model. FIG. 1 is a schematic diagram illustrating a speech recognition system according to an exemplary embodiment. As shown in fig. 1, first, user speech is input to a feature extraction model for feature, and a user speech signal is converted from a time domain to a frequency domain to provide an appropriate feature vector for an acoustic model. The feature vectors are then input into an acoustic model, which computes a score for each feature vector over the acoustic features based on the acoustic characteristics. The language recognition model calculates the probability of the possible phrase sequence corresponding to the sound signal according to the theory related to linguistics, and finally obtains the final possible text by utilizing the speech decoding and searching algorithm according to the score of each feature vector on the acoustic features calculated by the existing dictionary and acoustic model and the probability of the possible phrase sequence calculated by the language recognition model.

As shown in fig. 1, the training process of the acoustic model is as follows: firstly, feature extraction is carried out on voice samples in a voice database, and an acoustic model is trained by using a feature extraction result to obtain a trained acoustic model. Similarly, the training process for the language recognition model is as follows: and training the language recognition model by using the text samples in the text database to obtain the trained language recognition model.

In addition, in order to enable the feature extraction model to extract effective features, the collected sound signals can be subjected to preprocessing of audio data such as filtering and framing, the audio signals to be analyzed are extracted from the original audio signals, and then the feature extraction model is used for performing feature extraction on the audio signals to be analyzed.

FIG. 2 is a flow diagram illustrating a method of generating a language recognition model for use in the speech recognition system shown in FIG. 1, according to an exemplary embodiment. As shown in fig. 2, the method may include the following steps.

In step S21, a first base model is generated according to the first text corpus. The first text corpus is pre-collected labeled texts corresponding to user voices.

It should be understood that, in the case that a language recognition model already exists but a new language recognition model needs to be generated, the first text corpus may be a labeled text corresponding to a user speech corresponding to a speech recognition system in which the existing language recognition model exists. Illustratively, it may be annotated text corresponding to the user's speech that was historically input by the speech recognition system. The present disclosure is not particularly limited thereto.

In the present disclosure, the first base model may be an N-gram model. The first text corpus may be a labeled text of a user's voice during a preset time period, which the user has interacted with the voice recognition system, a labeled text of the user's voice and the number of occurrences of each user's voice during the preset time period, and so on. Therefore, the first basic model generated by the first text corpus can fit the real request of the user, so that the language recognition model can recognize most of the user voices.

In step S22, a vertical domain category to which each labeled text belongs in the first text corpus is determined, and a text quantity corresponding to each vertical domain category is counted.

In practical application, the voice recognition system is applied in different scenes, and the corresponding online user voices belong to different categories. For example, if the voice recognition system is applied to a flight ticket reservation scenario, the category to which the online user voice belongs is generally a flight shift query category, a flight price query category, a ticket change category, and the like. If the speech recognition system is applied to a leisure and entertainment scene, for example, an intelligent sound device with a speech recognition system, the category to which the online user speech belongs is usually a music category, a movie category, a radio station category, a poetry category, an encyclopedia category, a chatting category, and the like. Therefore, in the disclosure, different models can be generated for different vertical domain category texts, so as to achieve the purpose of rapidly positioning and identifying the abnormal problem.

In step S23, a preset number of vertical domain categories with the largest number of texts are determined as target vertical domain categories.

In step S24, for each target vertical domain category, a target vertical domain category model corresponding to the target vertical domain category is generated according to the labeled text corresponding to the target vertical domain category.

Considering that the online user speech corresponding to a general speech recognition system belongs to a large number of classes, if a model is generated for each class, the workload of generating the speech recognition model is increased, and the probability of a recognition error occurring in the speech recognition system is high for the text of the class with a high frequency of use, therefore, in the present disclosure, the model is generated only for the common vertical domain class corresponding to the speech recognition system.

For example, data analysis is performed on the first text corpus, a vertical domain category to which each tagged text belongs can be determined, each vertical domain category is determined to correspond to the number of texts, then, the vertical domain categories are sorted according to the number of the texts to obtain a sorted list, the first N vertical domain categories in the sorted list are determined to be target vertical domain categories, and for each target vertical domain category, a target vertical domain category model corresponding to the target vertical domain category is generated according to the tagged text corresponding to the target vertical domain category.

For example, data analysis is performed on the first text corpus, and three vertical domain categories with the largest text number are determined to be a music category, a movie category and a sound station category respectively, that is, the music category, the movie category and the sound station category are all target vertical domain categories. The method comprises the steps of training a marked text belonging to a music category as a training sample of a music category model to generate a music category model, training a marked text belonging to a film and television category as a training sample of a film and television category model to generate a film and television category model, and training a marked text belonging to a sound radio station category as a training sample of a sound radio station category model to generate a sound radio station category model.

In step S25, a language identification model is generated based on the first base model and each target vertical domain category model.

Illustratively, the first base model and each target vertical domain category model are subjected to interpolation combination to generate a language identification model.

It should be appreciated that a speech recognition system receives over a billion user requests per day, and in order to ensure the recognition accuracy of the speech recognition system, the speech recognition system needs to be updated frequently, i.e., the speech recognition model needs to be updated. In the present disclosure, when a language recognition model needs to be updated, first, user voices input by a user in the voice recognition system and the number of times of each user voice are counted within a period of time, then, a labeled text corresponding to each user voice is obtained, the labeled text corresponding to the user voice and the number of times of the user voice are determined as a first text corpus used for updating a first base model and a target vertical domain category model, and the first base model and the target vertical domain category model are updated by using the first text corpus, so as to obtain a new first base model and a new target vertical domain category model.

According to the technical scheme, a first basic model is generated by utilizing a first text corpus, then, the vertical domain category to which each labeled text belongs in the first text corpus is determined, the number of texts corresponding to each vertical domain category is counted, the vertical domain category with the largest number of texts is determined as the target vertical domain category, the target vertical domain category model corresponding to the target vertical domain category is generated according to the labeled text corresponding to the target vertical domain category aiming at each target vertical domain category, and finally, the language identification model is generated according to the first basic model and the target vertical domain category model. Therefore, the language recognition model is generated by superposition of the multiple layers of models, when the language recognition model is updated subsequently, the language recognition model can be updated only by updating the first basic model and the target vertical domain type model, so that the aims of quick iteration and updating can be fulfilled, the iteration and updating efficiency of the language recognition model is improved, and the accuracy of the speech recognition system for user speech recognition is further improved.

In addition, for other vertical domain categories corresponding to the non-target vertical domain category, because the other vertical domain categories are not commonly used categories and have low use frequency, namely, the probability of occurrence of recognition errors is low, one model, namely, a multi-vertical domain category model can be generated for the labeled texts of the other vertical domain categories, so that the structure of the language recognition model is simplified. In one embodiment, the method may further comprise: generating a multi-vertical-domain category model according to the labeled texts corresponding to the other vertical-domain categories except the target vertical-domain category, and accordingly, in step S25, generating a language identification model according to the first basic model and the target vertical-domain category model, includes: generating a second basic model according to the first basic model and the multi-vertical domain category model; and generating a language recognition model according to the second basic model and the target vertical domain category model.

Exemplarily, the labeled texts corresponding to other vertical domain categories except the target vertical domain category are used as training samples of the multi-vertical domain category model for training, so as to obtain the multi-vertical domain category model of the language recognition model. For example, a multi-vertical-domain type model is obtained by training using a labeled text corresponding to vertical-domain types of poetry, encyclopedia and chatting as a training sample. The multi-vertical domain model can be an N-gram model.

And after the multi-vertical domain type model is obtained, generating a second basic model according to the first basic model and the multi-vertical domain type model. Illustratively, the first base model and the multi-vertical domain category model are subjected to interpolation combination to obtain a second base model of the language identification model. The multi-vertical domain type model can increase the generalization capability of the model, so that the second basic model determined according to the first basic model and the multi-vertical domain type model can ensure that the language recognition model has better generalization capability on the basis of recognizing most of user voices.

It should be understood that the number of texts corresponding to other vertical domain categories is small, the texts are relatively complex, the second basic model is generated by using the texts, and the multi-vertical domain category model trained by part of the texts can supplement the target vertical domain category model, so that the generalization capability of the language recognition model is improved.

By adopting the technical scheme, the text of the target vertical domain type with high use frequency is independently modeled, and the multi-vertical domain type model is generated for the text of other vertical domain types with low use frequency, so that on one hand, the language identification model can cover the global text, the identification accuracy of the language identification model is improved, and on the other hand, the purpose of quickly positioning and identifying abnormal problems can be realized.

In one embodiment, the step S21 in fig. 2 of generating the first base model according to the first text corpus may include the following steps.

(1) And generating an online data model according to the first text corpus.

Because the first text corpus is a labeled text of the user voice of the user talking with the voice recognition system in the preset time period, the online data model generated by the first text corpus can completely accord with the normal distribution of the user request, and can cover most of the user request.

(2) And determining a labeled text corresponding to the user voice with the recognition error in the first text corpus, and generating a first error correction model according to the labeled text corresponding to the user voice with the recognition error.

In practical applications, a user sentence with a recognition error usually occurs in the speech recognition system, and in order to improve the recognition accuracy of the speech recognition system, in this embodiment, the language recognition model may further include a first error correction model for correcting the user sentence with the recognition error in history.

Illustratively, according to a labeled text corresponding to the speech recognition system reported by the user and identified by the error speech, a labeled text corresponding to the speech of the user identified by the error is determined in the first text corpus. For example, the voice of the user is of a rain forest type, and the voice recognition system replies related contents about fish scales, so that the user can report that the voice of the user with the wrong recognition by the voice recognition system is of the rain forest type. Further illustratively, when the voice recognition system does not accurately recognize the user voice, the user may output another voice related to the user voice again, and thus, the voice recognition system may determine whether to accurately recognize the user voice from the input next sentence voice. For example, the user voice is "rainforest type", the voice recognition system replies about the content of "fish scales", and usually, the user inputs the user voice "weather rainforest type" again, so that when the voice recognition system receives another user voice "weather rainforest type" related to the user voice "rainforest type", the voice recognition system determines that the voice recognition system fails to accurately recognize the user voice "rainforest type", and further determines the user voice "rainforest type" as the user voice with wrong recognition.

(3) A first base model is generated based on the online data model and the first error correction model.

Illustratively, the online data model and the first error correction model are combined by interpolation to obtain a first basic model of the language identification model.

Therefore, the first basic model of the language identification model is obtained by utilizing the online data model and the first error correction model, on one hand, the generated language identification model comprises models with more layers, the iteration and updating efficiency of the language identification model is improved by utilizing the characteristic that an upper layer model in a multi-layer model can be iterated and updated rapidly, on the other hand, the first error correction model is utilized to correct the user data with identification errors accumulated in a certain time, and the identification precision of the language identification model is further improved.

In addition, considering that under the condition that the number of the user voices of the user dialogues with the voice recognition system is small, the scenario to which the user sentences belong is single, and if the language recognition model is generated only by using the markup text of the user voices of the user dialogues with the voice recognition system, the generated language recognition model has high recognition accuracy only in some scenarios and has low recognition accuracy in other scenarios, therefore, in another embodiment, the step S21 in fig. 2 may further include: determining a sentence pattern of each labeled text in the first text corpus, and determining the sentence pattern with the occurrence frequency larger than a preset threshold value as a target sentence pattern; and acquiring a second text corpus constructed by the user according to the target sentence pattern, and generating a new data model according to the second text corpus.

In order to enrich the text corpus to complement the scenes to which the user voice belongs, sentences can be made by utilizing sentence patterns with high online use frequency. Illustratively, a preset threshold value is set, a sentence pattern of each labeled text in the first text corpus is determined, and the sentence pattern with the occurrence frequency larger than the preset threshold value is determined as a target sentence pattern, that is, as a sentence pattern with high online use frequency. And then, outputting the target sentence pattern, so that a user can construct a second text corpus according to the target sentence pattern conveniently, and training by using the second text corpus to obtain a new data model.

Accordingly, generating the first base model from the online data model and the first error correction model may include: and carrying out interpolation and combination on the online data model, the first error correction model and the newly-added data model to generate a first basic model.

By adopting the technical scheme, the first basic model of the language identification model is obtained by interpolating and combining the online data model, the first error correction model and the newly added data model, so that the first basic model can cover a larger scene range, the generalization capability of the first basic model is improved, and the generalization capability of the language identification model is further improved.

In addition, considering that in practical applications, a sudden high heat resource text may usually occur, where the high heat resource text may be a movie resource text, a music resource text, a news resource text, or the like that is hot in the recent period of time, in order to ensure that the speech recognition system can recognize the user's request for the high heat resource, in an embodiment, the method may further include: and acquiring a hot resource text in a preset time period, and generating a resource model according to the hot resource text.

In the present disclosure, the hot resource text in the preset time period may include text corresponding to the online user voice that has been requested by the user in the voice recognition system, and/or text corresponding to the offline user voice that has not been requested by the user in the voice recognition system. The present disclosure is not particularly limited thereto. It should be understood that the speech recognition system is applied to different services, and the corresponding hot spot resources are different. For example, for a speech recognition system in an intelligent speaker, the corresponding hot resource text is usually a movie resource text, a music resource text, a news resource text, or the like which is popular in the recent period of time.

Furthermore, it should be understood that the language recognition model serves as a core model in the speech recognition system, which can be iterated quickly to improve the generalization ability of the model in a shortest time, but the stability of iteration is also a factor to be considered when generating the language recognition model, and therefore, a stable model structure is required when generating the language recognition model to ensure the stability of the iteration of the language recognition model. The second base model has a large weight in the language identification model, and in order to ensure the stability of the language identification model, the update cycle of the second base model is generally long and the update frequency is low. Therefore, the update cycle of the first error correction model is longer.

However, in order to ensure the recognition accuracy of the speech recognition system, frequent updates are usually required for the problem of speech recognition errors to correct the recognition errors of the speech recognition system, and therefore, in one embodiment, the method may further include: and acquiring a labeled text corresponding to the user voice with the recognition error in the current time period, and generating a second error correction model according to the labeled text corresponding to the user voice with the recognition error in the current time period. For example, if the update cycle of the second error correction model is one day, the text corpus corresponding to the speech recognized incorrectly in the current day is obtained. For example, the annotation text corresponding to the user voice with the recognition error on the T +1 th day is obtained when the second error correction model is generated.

Since the resource model and the second error correction model need to be updated frequently, a dynamic model of the language identification model can be obtained according to the resource model and the second error correction model, wherein the dynamic model refers to a model which needs to be updated frequently. Illustratively, the resource model and the second error correction model are combined by interpolation to generate a dynamic model. And the updating frequency of the dynamic model is greater than that of the second basic model. In this way, the stability of the language recognition model can be ensured.

Accordingly, in step S25 in fig. 2, the specific implementation manner of generating the language identification model according to the first base model and the target vertical domain category model is as follows: and generating a language identification model according to the first basic model, the each target vertical domain category model and the dynamic model.

By adopting the technical scheme, the resource model is generated by utilizing the hot spot resource text, the second error correction model is generated by utilizing the label text corresponding to the recognized wrong voice in the current time period, and the dynamic model is obtained based on the resource model and the second error correction model, so that the language recognition model can recognize the high heat resource text, the generalization capability of the language recognition model is further improved, the problem of the voice recognition error can be quickly repaired, and the recognition accuracy of the voice recognition system is further improved.

In addition, as technology develops, the service of the speech recognition system is expanded differently, and in order to enable the speech recognition system to meet the new service requirement, in one embodiment, the language recognition model may further include a service requirement model, and the service requirement model is used for overlaying the requirement text. Illustratively, the method may further comprise: acquiring a requirement text corpus related to a requirement service, which is input by a user, and generating a service requirement model according to the requirement text corpus. Exemplarily, assuming that the newly added service is to control the air conditioner to work by using a voice recognition system, the required text corpus is a related text corpus for controlling the air conditioner to work.

Accordingly, in step S25 in fig. 2, the specific implementation manner of generating the language identification model according to the first base model and the target vertical domain category model is as follows: and generating a language identification model according to the first basic model, each target vertical domain type model and the business requirement model.

By adopting the technical scheme, when the service of the voice recognition system needs to be added, the language recognition model can quickly learn the text corresponding to the newly added service by using the service requirement model, so that the voice recognition system can recognize the request corresponding to the service and meet the user requirement.

Illustratively, FIG. 3 is a schematic diagram of a language recognition model shown in accordance with an exemplary embodiment. As shown in fig. 3, first, the online data model, the first error correction model, and the newly added data model are interpolated and merged to obtain a first basic model, and the resource model and the second error correction model are interpolated and merged to obtain a dynamic model, and then, the first basic model and the multi-vertical domain type model are interpolated and merged to obtain a second basic model. And finally, carrying out interpolation combination on the second basic model, each target vertical domain type model, the dynamic model and the service demand model to obtain a language identification model. In fig. 3, the target vertical domain model is exemplified as a music category model, a video category model, and a sound station category model. That is, as shown in fig. 3, the second basic model, the music category model, the film category model, the radio station category model, the dynamic model, and the service requirement model are interpolated and combined to obtain the language identification model.

The problem that the voice recognition system generates voice recognition errors when errors occur in an acoustic model, a feature extraction model and the like in the voice recognition system is considered. Therefore, before iterating and updating the language recognition model, whether the reason of the voice recognition error of the voice recognition system is the recognition error of the language recognition model can be determined, if the reason is the recognition error of the language recognition model, the language recognition model is determined to need to be iterated and updated, otherwise, the language recognition model is determined not to need to be iterated and updated.

By way of example, the specific implementation of determining whether iteration and updating of the language identification model is required is: firstly, respectively determining the confusion degree of an error text and the confusion degree of a correct text in an error text (the error text is a text recognized by a voice recognition system) corresponding to a voice with a recognition error reported by a user and a labeled text corresponding to the voice; then, if the confusion of the correct text is less than the confusion of the wrong text, determining that the speech recognition system has a recognition error due to the inaccuracy of the acoustic model, and further, the speech recognition model does not need to be iterated and updated, and if the confusion of the correct text is greater than the confusion of the wrong text, determining that the speech recognition model needs to be iterated and updated. And then, in a training sample, namely a first text corpus, of the iteration and updating language recognition model, determining whether the error text exists in the first text corpus, if so, changing the error text in the first text corpus into a correct text, and if not, adding the correct text into the first text corpus. Meanwhile, the dictionary of the speech recognition system is inquired, whether the pinyin corresponding to the wrong text in the dictionary is accurate or not is determined, and if the pinyin corresponding to the wrong text in the dictionary is inaccurate, the pinyin corresponding to the wrong text in the dictionary can be corrected.

Therefore, the problem repairing tool is established by using the ways of confusion contrast, training text index, dictionary query and the like, so that the manual operation is effectively reduced, and the problem of errors in multi-dimensional positioning and recognition is effectively solved.

Based on the same inventive concept, the disclosure also provides a generating device of the language identification model. FIG. 4 is a block diagram illustrating an apparatus for generating a language recognition model in accordance with an exemplary embodiment. As shown in fig. 4, the generating device 400 of the language recognition model may include:

a first generating module 401, configured to generate a first basic model according to a first text corpus, where the first text corpus is a pre-collected tagged text corresponding to a user voice;

a first determining module 402, configured to determine a vertical domain category to which each tagged text in the first text corpus belongs, and count a text quantity corresponding to each vertical domain category;

a second determining module 403 configured to determine a preset number of vertical domain categories with the largest number of texts as target vertical domain categories;

a second generating module 404, configured to generate, for each target vertical domain category, a target vertical domain category model corresponding to the target vertical domain category according to the labeled text corresponding to the target vertical domain category;

a third generating module 405 configured to generate a language identification model according to the first base model and each of the target vertical domain category models.

Optionally, the first generating module 401 includes:

the first generation submodule is configured to generate an online data model according to the first text corpus; and

the second generation submodule is configured to determine a labeled text corresponding to the user voice with the recognition error in the first text corpus, and generate a first error correction model according to the labeled text corresponding to the user voice with the recognition error;

a third generation submodule configured to generate a first base model from the inline data model and the first error correction model.

Optionally, the first generating module 401 further includes:

the first determining submodule is configured to determine a sentence pattern of each labeled text in the first text corpus, and determine the sentence pattern with the occurrence frequency larger than a preset threshold value as a target sentence pattern;

the fourth generation submodule is configured to acquire a second text corpus constructed by the user according to the target sentence pattern, and generate a new data model according to the second text corpus;

the third generation submodule is configured to: and carrying out interpolation combination on the online data model, the first error correction model and the newly added data model to generate a first basic model.

Optionally, the apparatus further comprises:

a fourth generation module, configured to generate a multi-vertical-domain category model according to the labeled texts corresponding to other vertical-domain categories except the target vertical-domain category;

the third generating module 405 is configured to: generating a second basic model according to the first basic model and the multi-vertical domain category model; and generating a language recognition model according to the second basic model and each target vertical domain category model.

Optionally, the apparatus further comprises:

the fifth generation module is configured to acquire a hot resource text in a preset time period and generate a resource model according to the hot resource text; and

a sixth generating module, configured to obtain a labeled text corresponding to the user voice identified with an error in the current time period, and generate a second error correction model according to the labeled text corresponding to the user voice identified with an error in the current time period;

a seventh generating module, configured to perform interpolation combination on the resource model and the second error correction model to generate a dynamic model;

the third generating module 405 is configured to: and generating a language identification model according to the first basic model, each target vertical domain category model and the dynamic model.

Optionally, the apparatus further comprises:

the eighth generation module is configured to acquire a requirement text corpus related to a requirement service, which is input by a user, and generate a service requirement model according to the requirement text corpus;

the third generating module 405 is configured to: and generating a language identification model according to the first basic model, each target vertical domain type model and the service demand model.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of generating a language identification model provided by the present disclosure.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, electronic device 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output interface 512, sensor component 514, and communication component 516.

The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the electronic device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 506 provides power to the various components of the electronic device 500. The power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 500.

The multimedia component 508 includes a screen that provides an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The input/output interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the electronic device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or a component of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the electronic device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communications between the electronic device 500 and other devices in a wired or wireless manner. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 516 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a language identification model, comprising:

2. The method of claim 1, wherein generating the first base model from the first corpus of text comprises:

generating an online data model according to the first text corpus; and

3. The method of claim 2, wherein generating the first base model from the first text corpus further comprises:

4. The method of claim 1, further comprising:

5. The method of claim 4, further comprising:

acquiring a label text corresponding to the user voice with the error identified in the current time period, and generating a second error correction model according to the label text corresponding to the user voice with the error identified in the current time period;

and generating a language identification model according to the first basic model, each target vertical domain category model and the dynamic model.

6. The method of claim 5, wherein the dynamic model is updated more frequently than the second base model.

7. The method of claim 1, further comprising:

acquiring a requirement text corpus which is input by a user and is related to a requirement service, and generating a service requirement model according to the requirement text corpus;

8. An apparatus for generating a language identification model, comprising:

the first generation module is configured to generate a first basic model according to a first text corpus, wherein the first text corpus is a pre-collected labeled text corresponding to user voice;

9. A speech recognition system comprising a feature extraction model, an acoustic model, a language recognition model, a speech decoding and search model, wherein the language recognition model is generated according to the method of any one of claims 1-7.

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

and generating a language recognition model according to the first basic model and each target vertical domain category model.

11. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.