CN113593531B - Voice recognition model training method and system - Google Patents

Voice recognition model training method and system Download PDF

Info

Publication number
CN113593531B
CN113593531B CN202110874667.4A CN202110874667A CN113593531B CN 113593531 B CN113593531 B CN 113593531B CN 202110874667 A CN202110874667 A CN 202110874667A CN 113593531 B CN113593531 B CN 113593531B
Authority
CN
China
Prior art keywords
model
training
trained
user
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110874667.4A
Other languages
Chinese (zh)
Other versions
CN113593531A (en
Inventor
温亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202110874667.4A priority Critical patent/CN113593531B/en
Publication of CN113593531A publication Critical patent/CN113593531A/en
Application granted granted Critical
Publication of CN113593531B publication Critical patent/CN113593531B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice recognition model training method, which comprises the following steps: determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained; acquiring a preset field training data set uploaded by a user; and training the model to be trained based on the preset field training data set. According to the method, a model to be trained is determined according to the selection operation of a user, a preset field training data set uploaded by the user is obtained, and then the model to be trained is trained based on the preset field training data set. The user only needs to select the model to be trained according to the requirements, and the training data set in the target field is uploaded to complete the training of the selected model, so that the training of the voice recognition model is obtained, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more autonomously, and the threshold and cost of training the model are reduced.

Description

Voice recognition model training method and system
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method and system for training a speech recognition model.
Background
Along with the increase of data volume, the enhancement of computing power and the development of deep learning theory technology, the accuracy of voice recognition is continuously improved, and the application field is continuously widened. The voice recognition application is interactive, such as a voice assistant mounted on a car machine/mobile phone, and converts the voice of a user into characters which can be understood by the machine through voice recognition, so that the machine executes corresponding tasks and gives feedback, and natural man-machine communication is realized. In addition, non-interactive applications, such as security of drivers and passengers through travel recording, and applications in the fields of customer service quality inspection, intelligent outbound, etc., are also provided.
Taking interactive products as an example, the accuracy of speech recognition can basically reach a word level of 95%. But this does not meet the business needs of the day-to-day variation. Especially, for the newly added special vocabulary in the subdivision field, such as English words, place names and professional terms, if the model is not targeted to tuning, the speech recognition model of any manufacturer is difficult to meet the service requirement.
At present, model training and coefficient adjustment are generally carried out in the research and development stage at the voice recognition optimization level, and deployment is finally carried out after the test is passed. However, if new special data cannot be inferred after deployment, the model training and coefficient adjustment can only be carried out again after the new special data is encountered, the test is carried out again, and the deployment is carried out after adjustment. Two problems arise here: firstly, retraining is required to return to a research and development stage for model training and coefficient adjustment, so that errors caused by human reasons of the model are unknown, and the iteration of the model cannot be performed in real time, so that the efficiency is low; secondly, the conventional training needs to follow the batch mode for retraining, otherwise, the model can not be converged, so that the model can not be adjusted immediately after special data are encountered each time, the retraining and redeployment can be carried out after enough data are collected, and the requirement of rapid iteration of the service on recognition can not be met. Because the optimization period of the service can reach several weeks or even months, the time of a plurality of service lines is overlapped, emergency demands can occur occasionally, the service is completely processed by limited voice engineers, the service cannot respond in time, and the support strength is insufficient. In addition, the communication cost is high, the clients excessively depend on the voice manufacturer, and the clients do not exert space, so that the service promotion and the user experience are affected.
Disclosure of Invention
The embodiment of the invention provides a voice recognition model training method and a system, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, including:
Determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;
acquiring a preset field training data set uploaded by a user;
and training the model to be trained based on the preset field training data set.
In a second aspect, an embodiment of the present invention provides a speech recognition model training system, including:
The model selection program module is used for determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;
The user data input program module is used for acquiring a preset field training data set uploaded by a user;
And the model training program module is used for training the model to be trained based on the preset field training data set.
In a third aspect, embodiments of the present invention provide a storage medium having stored therein one or more programs including execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing any of the above-described speech recognition model training methods of the present invention.
In a fourth aspect, there is provided an electronic device comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the speech recognition model training methods of the present invention.
In a fifth aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any one of the above-described speech recognition model training methods.
According to the embodiment of the invention, a model to be trained (an acoustic model to be trained and/or a language model to be trained and/or a hotword model to be trained) is determined according to the selection operation of a user, a preset field training data set uploaded by the user is obtained, and then the model to be trained is trained based on the preset field training data set. The user only needs to select the model to be trained according to the requirements, and the training data set in the target field is uploaded to complete the training of the selected model, so that the training of the voice recognition model is obtained, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more autonomously, and the threshold and cost of training the model are reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 2 is a flow chart of another embodiment of a speech recognition model training method of the present invention;
FIG. 3 is a functional block diagram of one embodiment of a speech recognition model training system of the present invention;
FIG. 4 is a functional block diagram of another embodiment of a speech recognition model training system of the present invention;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In the present invention, "module," "device," "system," and the like refer to a related entity, either hardware, a combination of hardware and software, or software in execution, as applied to a computer. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, the application or script running on the server, the server may be an element. One or more elements may be in processes and/or threads of execution, and elements may be localized on one computer and/or distributed between two or more computers, and may be run by various computer readable media. The elements may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., a signal from one data packet interacting with another element in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, an embodiment of the present invention provides a method for training a speech recognition model, including:
s11, determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained.
S12, acquiring a preset field training data set uploaded by a user.
Illustratively, acquiring the preset domain training data set uploaded by the user includes: and detecting and acquiring a preset domain training data set uploaded by the user on the interactive interface, or detecting an acquisition request sent by the user through calling the API interface to acquire the preset domain training data set.
S13, training the model to be trained based on the preset field training data set.
According to the method, a model to be trained is determined according to the selection operation of a user, a preset field training data set uploaded by the user is obtained, and then the model to be trained is trained based on the preset field training data set. The user only needs to select the model to be trained according to the requirements, and the training data set in the target field is uploaded to complete the training of the selected model, so that the training of the voice recognition model is obtained, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more autonomously, and the threshold and cost of training the model are reduced.
As shown in fig. 2, a flowchart of another embodiment of the training method of the speech recognition model according to the present invention, in this embodiment, the training the model to be trained based on the preset domain training data set includes:
s131, processing the training data set in the preset field to obtain an audio data set with text labels for the acoustic model so as to train the acoustic model;
s132, for the language model, processing the training data set in the preset field to obtain pure text corpus or a corpus template based on user definition so as to train the language model;
s133, processing the training data set of the preset domain for the hot word model to obtain a vocabulary set of the preset domain so as to train the hot word model.
In this embodiment, the training data sets are processed specifically for different models (for example, an acoustic model, a language model and a hotword model) respectively, so that data suitable for training of each model is obtained, and rapid training of the model is facilitated. And because the training data sets in the preset field are respectively processed according to different requirements of different models on training data, the method for training by selecting the required models according to the needs of users is realized. The voice recognition model training method provided by the embodiment of the application can be used for training more flexibly, and the pertinence and the effectiveness of training are improved.
In some embodiments, the speech recognition model training method of the present invention further comprises: detecting a training mode selected by a user, wherein the training mode comprises an incremental training mode and a full training mode; the training the model to be trained based on the preset domain training data set comprises the following steps: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.
In this embodiment, when a user performs model training, a training mode (for example, an incremental training mode or a full-scale training mode) may be selected according to actual requirements, so that when the user needs emergency deployment and has relatively low recognition accuracy, the incremental training mode may be selected to implement rapid training and rapid deployment. When the user is not urgent for deployment and use and the recognition precision is high, the full training mode can be selected to ensure the precision requirement.
In some embodiments, the speech recognition model training method of the present invention further comprises: and testing the trained acoustic model, language model or hotword model by adopting a test audio data set uploaded by the users in batches.
In this embodiment, the user uploads the test audio data set in batches through the UI interface to test the trained acoustic model, the trained language model or the trained hotword model, so as to determine the performance of the model obtained by customization, and select the customized model with the best performance. The user carries out high-efficiency test on the trained model by uploading the test audio data set in batches, and because each test audio data is uploaded in batches without distinction, the test takeover is objective and accurate.
In some embodiments, the speech recognition model training method of the present invention further comprises: and selecting a single test audio to test the trained acoustic model or language model or hotword model.
In some embodiments, the plurality of acoustic models, the plurality of language models, and the plurality of hotword models are trained; in some embodiments, the speech recognition model training method of the present invention further comprises:
Testing a plurality of acoustic models by adopting a test audio data set uploaded by a user in batches to determine an acoustic model with optimal performance;
testing the plurality of language models by adopting a test audio data set uploaded by a user in batches to determine the language model with the optimal performance;
and respectively testing the plurality of hotword models by adopting a test audio data set uploaded by users in batches to determine the hotword model with the optimal performance.
In this embodiment, a plurality of models (for example, an acoustic model, a language model and a hotword model) are respectively pre-trained, so that a plurality of models can be tested simultaneously in a test stage, a model with the best performance is selected, and the efficiency of training and testing the models is improved.
FIG. 3 is a schematic block diagram of an embodiment of the speech recognition model training system of the present invention, including in this embodiment:
The model selection program module 100 is configured to determine a model to be trained according to a selection operation of a user, where the model to be trained includes at least one of an acoustic model to be trained, a language model to be trained, and a hotword model to be trained;
A user data input program module 200, configured to obtain a preset domain training data set uploaded by a user;
The model training program module 300 is configured to train the model to be trained based on the preset domain training data set.
According to the voice recognition model training system, a model to be trained is determined according to the selection operation of a user, a preset field training data set uploaded by the user is obtained, and then the model to be trained is trained based on the preset field training data set. The user only needs to select the model to be trained according to the requirements, and the training data set in the target field is uploaded to complete the training of the selected model, so that the training of the voice recognition model is obtained, the user does not need to have system algorithm and artificial intelligence knowledge, the user can complete recognition optimization more autonomously, and the threshold and cost of training the model are reduced.
In some embodiments, the obtaining the preset domain training data set uploaded by the user includes: and detecting and acquiring a preset domain training data set uploaded by the user on the interactive interface, or detecting an acquisition request sent by the user through calling the API interface to acquire the preset domain training data set.
In some embodiments, the training the model to be trained based on the preset domain training data set includes:
for the acoustic model, processing the training data set in the preset field to obtain an audio data set with text labels so as to train the acoustic model;
For the language model, processing the training data set in the preset field to obtain a pure text corpus or a corpus template based on a user definition so as to train the language model;
and for the hotword model, processing the training data set of the preset domain to obtain a vocabulary set of the preset domain so as to train the hotword model.
In some embodiments, the speech recognition model training system of the present invention further comprises:
The training mode detection module is used for detecting a training mode selected by a user, wherein the training mode comprises an incremental training mode and a full training mode; the training the model to be trained based on the preset domain training data set comprises the following steps: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.
In some embodiments, the speech recognition model training system of the present invention further comprises: and the first training module is used for testing the trained acoustic model, the trained language model or the trained hotword model by adopting the test audio data set uploaded by the users in batches.
In some embodiments, the speech recognition model training system of the present invention further comprises: and the second training module is used for selecting a single test audio to test the trained acoustic model, the trained language model or the trained hotword model.
In some embodiments, the plurality of acoustic models, the plurality of language models, and the plurality of hotword models are trained; in some embodiments, the speech recognition model training system of the present invention further comprises: the second training module is used for respectively testing the plurality of acoustic models by adopting a test audio data set uploaded by users in batches so as to determine the acoustic model with optimal performance; testing the plurality of language models by adopting a test audio data set uploaded by a user in batches to determine the language model with the optimal performance; and respectively testing the plurality of hotword models by adopting a test audio data set uploaded by users in batches to determine the hotword model with the optimal performance.
As shown in FIG. 4, a schematic block diagram of another embodiment of the speech recognition model training system of the present invention, in this embodiment, comprises: the system comprises a user data input module, a data preprocessing service module, a model training module, a model automatic evaluation test module, a model release online module, an online data acquisition module and a voice labeling module. Wherein:
(1) User data input module:
The user can upload the data set corpus used for training the language model, the acoustic model and the hotword model through the UI interface, and can also send an HTTP request by calling the API interface to transmit the data set corpus. At the corpus format requirement level:
Corpus used for language model training needs to be pure text corpus or corpus templates and entities based on user definition. For example, for aviation scenarios, ticket reservation services are involved. Similarly: "Xiaoming wants to order an air ticket flying from Shanghai to Beijing", we abstract a template as: "{ person name } wants to subscribe to { number } of tickets flying from { city } to { city }. In this template, the "person name", "number", "city" are slots, which may have a very large number of specific entry entities, i.e. user-customizable. Such as name: zhang three, lifour, etc., in amounts: one, two, etc., city: beijing, shanghai, shenzhen, guangzhou, etc.
This example, as described above, can be further extended and abstracted, i.e., defining the complexity of the template, the more complex the template, the more concrete speech samples that can be covered. Similar to the 'Xiaoming plan ordering a train ticket from Shanghai to Beijing', a template can be abstracted as follows: "{ person name } { action } { operation } { number } { time } (ticket) from { city } to { city }. Each bracket { } represents a reserved slot, and the slot may be filled with a corresponding entity entry.
Corpus used for acoustic model training needs to be an audio data set with text labels.
Corpus used for training a hot word model needs to be a vocabulary set in the professional field.
(2) And a data preprocessing module:
The data input by the user often has the problem of data format, and the received data are trained by a language model, an acoustic model and a hotword, and have respective standard format requirements. For this reason, input data needs to be processed, such as text normalization, word segmentation, audio format normalization, annotation data processing, and the like.
(3) Model training module:
The model training module mainly comprises: language model self-training, acoustic model self-training, hotword model self-training. A user can create a language model, an acoustic model and a hotword model through a UI interface or an API, and after the model is created, a unique task ID mark exists. The user may select a language model or an acoustic model training or a hotword model. For example, a user can operate through a UI interaction interface, and can also interact through an API to select a model to be trained, the user does not need to have system algorithm and artificial intelligence knowledge, the user can more autonomously complete recognition optimization, and the threshold and cost of training the model are reduced.
When the user selects the language model or the acoustic model training or the hotword model, different training modes, such as incremental training or full-scale training, can also be selected. In the incremental training mode, when a user triggers the incremental training, a model which is trained in a history mode can be selected, and the newly added data is subjected to iterative optimization on the basis of the history model. In the full-scale training mode, the user can select a historical training data set to perform superposition combination training, and can also train by using only the current data set. The user may also customize model training parameters in different training modes. For this reason, the user has not only a choice of multiple angle-optimized speech recognition, but also room for single angle depth optimization.
For example, for an acoustic model, the user may define a training pattern, in order to allow the user more freedom in choice from training time and training effect. Under the condition that the input data of the user is unchanged, three modes are supported, one mode is short in training time, and the optimization effect is general. Secondly, training time is slightly long, and the optimization effect is moderate. Thirdly, the training time is long, and the optimization effect is good. And secondly, selecting increment and full quantity, and adding data training on the basis of a model trained by a user through history. And the training can be performed again from beginning to end. The parameters of the more detailed partial-bottom algorithm are not released, so that unnecessary interference to a user is avoided.
For language models, the user is free to choose whether to interpolate the model with a large base model. Thus, we open up several parameters of model interpolation, mainly whether to interpolate, interpolate coefficients, whether to crop, crop coefficients. The parameters are used to indicate whether to interpolate with the large model and in what proportion the model obtained after interpolation is cut, and what proportion the cut is.
For the hotword model, the hotword abstract each slot, and the user can upload own word list based on the slot.
(4) Model evaluation test module:
the model evaluation test mainly comprises: objective test, subjective test, and contrast test.
Under objective test, the user can upload the test audio data set in batches, and select to perform recognition tests of different customized language models, acoustic models or hotword models so as to obtain the accuracy of voice recognition quantification under different situations.
The speech recognition model training system in the embodiment of the application is a model self-training system, and has the significance that a user can acquire data of a business scene based on the situation of the business scene, autonomously train a language model, an acoustic model and a hotword model, and finish custom optimization of a language recognition user level. For example, some users have more noise in business scenes, dialect languages, chinese confounds, more special names, or more special speech styles such as court trial and interrogation. In these cases, general generic speech recognition is not well satisfied and requires custom optimization, i.e. personalized targeted customization. By the aid of the system, thousands of people and thousands of faces can be achieved, users can finish the operation independently under a low threshold, and professional research and development personnel are not needed to participate.
When selecting to perform recognition tests of different customized language models, acoustic models or hotword models, it is assumed that the user has customized the language models, acoustic models and hotword models respectively. After customization, the user may upload the test dataset and select the language model, acoustic model, hotword model for testing, respectively. To measure which optimization method brings the most obvious effect improvement. The user can select the three components simultaneously to perform a joint test, and the recognition optimization effect brought by the combined action of the three dimensions is improved.
Under subjective test, a user can upload a single test audio, select to perform recognition tests of different customized language models, acoustic models or hotword models so as to obtain voice recognition results under different situations, and directly and intuitively verify the effectiveness of the self-training model.
Under the comparison test, the user can upload the audio data set in batches to perform the identification test. Before testing, the method needs to be divided into two different test groups, and different self-training language models, acoustic models or hotword models are selected for testing the identification effect difference of the two different self-training models so as to conveniently select a better model for use.
Illustratively, multiple models, e.g., two language models, two acoustic models, two hotword models, are trained during the training phase. When a user performs model training, the user can start from three dimensions, namely, the training type can select a language model, an acoustic model and a hotword model. Assume that the user selects a language model training type. The user can customize a number of language models under this training type. Then through model testing, the customized language models are compared, and which is better. Similarly, multiple acoustic models and hotword models may be customized.
(5) Model release online module:
The model release online module mainly completes the self-training of language, acoustic and hotword models of users and deploys online. In the actual production process, the identification service and the training service are often deployed on different machines in the same cluster, and the user obtains a model through training by the training service and self-training model resources stored in a server where the training service is located, so that the model is a production process. However, in the process of identifying service consumption, the produced model needs to be released and deployed on another machine to the machine where the service is identified to consume. Therefore, the user can trigger release through a UI interface or an API, and the model can be synchronized to the recognition machine by the background program to finish release deployment online.
(6) On-line data acquisition module:
The on-line data acquisition module mainly completes two things, namely, data is filtered according to a certain discrimination strategy and a threshold value, and the data is filtered and directly transmitted to the acoustic model training system. Secondly, on-line data are imported into a labeling system module according to a certain strategy and used for accurate data labeling.
For better optimization of the model, a large amount of labeling data is usually required for data support, however, the labeling data itself is acquired for a long time and at a high cost. In order to be able to perform the optimization of the model faster even without less annotation data. The semi-supervised training mode is provided, a large amount of unlabeled data on the line can be fully utilized, and the self-training optimization of the model can be rapidly carried out.
For example, for a new business scenario, we first pull the online voice data from the database table periodically by a timer, put it into a data collection module composed of multiple recall models and an optional discrimination policy, and collect the voice with higher recall quality by data and obtain the pseudo tag corresponding to it.
Mainly, although a large amount of voice data can be obtained through the internet, accurate labels corresponding to the voice data are not easily obtained. Therefore, the voice recognition performance is optimized under the condition of low data resources, and an unsupervised or semi-supervised learning method is adopted, so that the voice recognition performance is not lost to be an effective method, and a semi-supervised mode is adopted here.
Firstly, a few groups of data sets are randomly sampled by using a small amount of existing marked data, and a few initial acoustic recall models are obtained through training by a standard training process.
And secondly, performing recognition decoding on the online voice data without the labels by utilizing the obtained multiple initial acoustic recall models to obtain the recognized pseudo labels. This annotation can be stored either in the form of an optimal result or in the form of multiple candidate results, where we store the optimal result.
Then, due to the multiple acoustic recall models, the pseudo labels obtained by the speech recognition decoding have a plurality of recognition errors. Therefore, a discrimination strategy is defined, and the recognition result which is relatively reliable is reserved by combining the confidence level and the confusion level to carry out screening on the automatically generated pseudo labels.
Confidence level: and sequentially screening the data according to the posterior probability or likelihood score in decoding. A confidence score is given to each sentence, a score threshold is set, and sentences with high confidence are selected.
Degree of confusion: and calculating the confusion degree selection data of the decoding result and the initial language model. Setting a threshold value of the confusion degree, and selecting sentences with lower confusion degree.
The sentence confidence degree selection criterion is to select data from the reliability angle generated by a plurality of acoustic models of the decoded text, the confusion degree selection criterion is to select data from the matching degree of the decoded text and the language models, the two data selection criterion principles are different so that the two data selection criteria can complement each other, and the data selection strategies of the sentence confidence degree and the confusion degree are combined to mix and de-duplicate the data selected by the two methods to exert complementarity so that the reliability of data selection is higher.
The recall model refers to an acoustic model that corresponds to confidence criteria in the discrimination strategy. I.e. the data is screened in turn according to the posterior probability in the decoded result, a confidence threshold, e.g. 0.75, is designed, i.e. below this threshold, and an artificial annotation is sent. Above this threshold, are screened out for model training.
And the data recalled by us from the online are added into the model as new training data, and the optimization of the model and the adjustment of parameters are automatically carried out according to the test set provided by the service side. And finally, the optimized model is returned to the data acquisition recall module for data screening while capability output is carried out, and the data quality of the next recall is optimized by updating the recall model (wherein the capability output refers to the data after selection through a data screening system, enters a model training system and is used for producing the acoustic model obtained through training, and the model is used in a business scene and is also used as a recall model). The optimized acoustic model obtained each time is updated to the data acquisition recall module, and is regarded as one acoustic recall model. In order to improve the quality of the recalled data of the data acquisition module, when the data is recalled, a single model is not selected to be used for pseudo tag prediction, and under the condition that a plurality of models similar to a target scene are selected to be in a specified threshold value, the data is selected and the pseudo tag is predicted according to a certain similarity. The method can ensure that the quality of data can bring effective improvement to the training performance of the model, and can also increase the diversity of training samples, so that the model becomes more robust in the training process.
Illustratively, a single model refers to a neural network model that inputs as speech and outputs as pseudo tags for that speech, such as the acoustic recall model mentioned above, as well as a neural network model.
Illustratively, the specified threshold includes a confidence level and a confusion level. Wherein, certain similarity is under confidence level criterion, threshold can be set and adjusted, defaulting to 0.75. Certain similarity under the confusion criterion, the threshold may also be set to be adjusted, defaulting to 100.
(7) And a voice marking system module:
When the user autonomously optimizes the acoustic model and self-trains the acoustic model, a large amount of marking data is usually required to be subjected to data support, in order to enable the user to directly use the marked data for acoustic model training, the scheme provides a voice data marking system module, the user can upload a large amount of voice data through a UI interface or an API interface to issue a data marking task, the data marking task can be distributed to marking personnel for marking, after marking is finished, marking task data is pushed to data auditing, and the data passing the auditing can be directly used for language model or acoustic model training.
The invention realizes the self-training of a language model, an acoustic model and a hot word model, and the effect which can be directly achieved is to reduce the threshold of the user for autonomously optimizing the voice recognition performance and realize the efficient and rapid recognition performance optimization. The user does not need to understand the voice recognition algorithm logic of the deeper layer of the bottom layer, does not need to understand the link logic optimized by the bottom layer, and only needs to pay attention to the service scene and what real data can be acquired by the service scene. The data related to the business scene is collected and uploaded to the system, so that scene optimization can be completed from multiple angles. However, the system is deeper, because the whole self-training system is applicable to a wider business scene, and each flow module is operated as a single micro-service, the user can conveniently perform secondary development on the system. In addition, the modules of the system link are loosely coupled, so that convenience is provided for the convenience of user positioning, and the service stability is enhanced. When designing the system link, we follow the design principle of high cohesion and low coupling. Such as language model training, acoustic model training, hotword model training, are to optimize language recognition from multiple dimensions. The three are independent, and a user can selectively deploy one or more of the three. For this purpose, each optimization module exists separately as a business API service. And the three are computationally intensive, and are suitable for isolation from the aspects of service stability and maintenance.
It should be noted that, for simplicity of description, the foregoing method embodiments are all illustrated as a series of acts combined, but it should be understood and appreciated by those skilled in the art that the present invention is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention. In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In some embodiments, embodiments of the present invention provide a non-transitory computer readable storage medium having stored therein one or more programs comprising execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing any of the above-described speech recognition model training methods of the present invention.
In some embodiments, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described speech recognition model training methods.
In some embodiments, the present invention further provides an electronic device, including: the system comprises at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition model training method.
In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the program when executed by a processor implements a speech recognition model training method.
The speech recognition model training system of the embodiment of the present invention can be used to execute the speech recognition model training method of the embodiment of the present invention, and correspondingly achieve the technical effects achieved by implementing the speech recognition model training method of the embodiment of the present invention, which are not described herein again. The related functional modules may be implemented by a hardware processor (hardware processor) in an embodiment of the present invention.
Fig. 5 is a schematic hardware structure of an electronic device for performing a speech recognition model training method according to another embodiment of the present application, where, as shown in fig. 5, the device includes:
one or more processors 510 and a memory 520, one processor 510 being illustrated in fig. 5.
The apparatus for performing the speech recognition model training method may further include: an input device 530 and an output device 540.
The processor 510, memory 520, input device 530, and output device 540 may be connected by a bus or other means, for example in fig. 5.
The memory 520 is a non-volatile computer readable storage medium, and may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the speech recognition model training method in the embodiment of the present application. The processor 510 executes various functional applications of the server and data processing, i.e., implements the above-described method embodiment voice recognition model training method, by running non-volatile software programs, instructions, and modules stored in the memory 520.
Memory 520 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the speech recognition model training device, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 520 optionally includes memory located remotely from processor 510, which may be connected to the speech recognition model training device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the speech recognition model training device. The output 540 may include a display device such as a display screen.
The one or more modules are stored in the memory 520 that, when executed by the one or more processors 510, perform the speech recognition model training method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones (e.g., iPhone), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID and UMPC devices, etc., such as iPad.
(3) Portable entertainment devices such devices can display and play multimedia content. Such devices include audio, video players (e.g., iPod), palm game consoles, electronic books, and smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A speech recognition model training method, comprising:
Determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;
acquiring a preset field training data set uploaded by a user;
training the model to be trained based on the preset field training data set;
the step of obtaining the preset domain training data set uploaded by the user comprises the following steps:
the on-line data acquisition module randomly samples a plurality of groups of data sets by using labeled data, and trains a plurality of initial acoustic recall models through a standard training process;
performing recognition decoding on the online voice data without the labels by using the obtained multiple initial acoustic recall models to obtain the recognized pseudo labels;
screening the automatically generated pseudo labels by combining the confidence level and the confusion level;
the acoustic recall model corresponds to a confidence coefficient criterion in a discrimination strategy, and data are screened in sequence according to the posterior probability in the decoding result;
The data recalled on line is used as new training data to be added into the model, and optimization and parameter adjustment of the model are automatically carried out according to the test set provided by the service side; finally, the optimized model is put back to the data acquisition recall module for data screening while capability output is carried out, and the data quality of the next recall is optimized by updating the recall model;
when the data recall is carried out, the pseudo tag prediction is not carried out by selecting a single model, but the data selection and the pseudo tag prediction are carried out according to a certain similarity under the condition that a plurality of models similar to a target scene are selected and used for a specified threshold value.
2. The method according to claim 1, wherein the obtaining the preset domain training data set uploaded by the user comprises:
And detecting and acquiring a preset domain training data set uploaded by the user on the interactive interface, or detecting an acquisition request sent by the user through calling the API interface to acquire the preset domain training data set.
3. The method of claim 1, wherein the training the model to be trained based on the preset domain training dataset comprises:
for the acoustic model, processing the training data set in the preset field to obtain an audio data set with text labels so as to train the acoustic model;
For the language model, processing the training data set in the preset field to obtain a pure text corpus or a corpus template based on a user definition so as to train the language model;
and for the hotword model, processing the training data set of the preset domain to obtain a vocabulary set of the preset domain so as to train the hotword model.
4. The method as recited in claim 1, further comprising: detecting a training mode selected by a user, wherein the training mode comprises an incremental training mode and a full training mode;
the training the model to be trained based on the preset domain training data set comprises the following steps: and training the model to be trained by adopting a training mode selected by a user based on the preset field training data set.
5. The method of any one of claims 1-4, further comprising: and testing the trained acoustic model, language model or hotword model by adopting a test audio data set uploaded by the users in batches.
6. The method of any one of claims 1-4, further comprising: and selecting a single test audio to test the trained acoustic model or language model or hotword model.
7. The method of any one of claims 1-4, wherein the plurality of acoustic models, the plurality of language models, and the plurality of hotword models are trained;
The method further comprises the steps of:
Testing a plurality of acoustic models by adopting a test audio data set uploaded by a user in batches to determine an acoustic model with optimal performance;
testing the plurality of language models by adopting a test audio data set uploaded by a user in batches to determine the language model with the optimal performance;
and respectively testing the plurality of hotword models by adopting a test audio data set uploaded by users in batches to determine the hotword model with the optimal performance.
8. A speech recognition model training system, comprising:
The model selection program module is used for determining a model to be trained according to the selection operation of a user, wherein the model to be trained at least comprises one of an acoustic model to be trained, a language model to be trained and a hotword model to be trained;
The user data input program module is used for acquiring a preset field training data set uploaded by a user;
The model training program module is used for training the model to be trained based on the preset field training data set;
the step of obtaining the preset domain training data set uploaded by the user comprises the following steps:
the on-line data acquisition module randomly samples a plurality of groups of data sets by using labeled data, and trains a plurality of initial acoustic recall models through a standard training process;
performing recognition decoding on the online voice data without the labels by using the obtained multiple initial acoustic recall models to obtain the recognized pseudo labels;
screening the automatically generated pseudo labels by combining the confidence level and the confusion level;
the acoustic recall model corresponds to a confidence coefficient criterion in a discrimination strategy, and data are screened in sequence according to the posterior probability in the decoding result;
The data recalled on line is used as new training data to be added into the model, and optimization and parameter adjustment of the model are automatically carried out according to the test set provided by the service side; finally, the optimized model is put back to the data acquisition recall module for data screening while capability output is carried out, and the data quality of the next recall is optimized by updating the recall model;
when the data recall is carried out, the pseudo tag prediction is not carried out by selecting a single model, but the data selection and the pseudo tag prediction are carried out according to a certain similarity under the condition that a plurality of models similar to a target scene are selected and used for a specified threshold value.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-7.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-7.
CN202110874667.4A 2021-07-30 2021-07-30 Voice recognition model training method and system Active CN113593531B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110874667.4A CN113593531B (en) 2021-07-30 2021-07-30 Voice recognition model training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110874667.4A CN113593531B (en) 2021-07-30 2021-07-30 Voice recognition model training method and system

Publications (2)

Publication Number Publication Date
CN113593531A CN113593531A (en) 2021-11-02
CN113593531B true CN113593531B (en) 2024-05-03

Family

ID=78252916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110874667.4A Active CN113593531B (en) 2021-07-30 2021-07-30 Voice recognition model training method and system

Country Status (1)

Country Link
CN (1) CN113593531B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764281A (en) * 2018-04-18 2018-11-06 华南理工大学 A kind of image classification method learning across task depth network based on semi-supervised step certainly
CN109637525A (en) * 2019-01-25 2019-04-16 百度在线网络技术(北京)有限公司 Method and apparatus for generating vehicle-mounted acoustic model
CN110852075A (en) * 2019-10-08 2020-02-28 厦门快商通科技股份有限公司 Voice transcription method and device for automatically adding punctuation marks and readable storage medium
CN111369978A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN112163634A (en) * 2020-10-14 2021-01-01 平安科技(深圳)有限公司 Example segmentation model sample screening method and device, computer equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11011162B2 (en) * 2018-06-01 2021-05-18 Soundhound, Inc. Custom acoustic models

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764281A (en) * 2018-04-18 2018-11-06 华南理工大学 A kind of image classification method learning across task depth network based on semi-supervised step certainly
CN111369978A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN109637525A (en) * 2019-01-25 2019-04-16 百度在线网络技术(北京)有限公司 Method and apparatus for generating vehicle-mounted acoustic model
CN110852075A (en) * 2019-10-08 2020-02-28 厦门快商通科技股份有限公司 Voice transcription method and device for automatically adding punctuation marks and readable storage medium
CN112163634A (en) * 2020-10-14 2021-01-01 平安科技(深圳)有限公司 Example segmentation model sample screening method and device, computer equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于解码多候选结果的半监督数据挑选的语音识别》;王兮楼等;《模式识别与人工智能》;第31卷(第7期);第662-667页 *
邱锡鹏等.《神经网络与深度学习》.《机械工业出版社》,2020,第240-241页. *

Also Published As

Publication number Publication date
CN113593531A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN107978311B (en) Voice data processing method and device and voice interaction equipment
CN108877791B (en) Voice interaction method, device, server, terminal and medium based on view
US10643632B2 (en) Automated voice assistant personality selector
JP2019079034A (en) Dialog system with self-learning natural language understanding
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
US20200125967A1 (en) Electronic device and method for controlling the electronic device
CN110222827A (en) The training method of text based depression judgement network model
CN105869633A (en) Cross-lingual initialization of language models
CN110489521B (en) Text type detection method and device, electronic equipment and computer readable medium
CN107515857B (en) Semantic understanding method and system based on customization technology
JP6728319B2 (en) Service providing method and system using a plurality of wake words in an artificial intelligence device
US11315547B2 (en) Method and system for generating speech recognition training data
CN111179915A (en) Age identification method and device based on voice
CN110992937B (en) Language off-line identification method, terminal and readable storage medium
CN111312233A (en) Voice data identification method, device and system
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN109271503A (en) Intelligent answer method, apparatus, equipment and storage medium
CN115798518A (en) Model training method, device, equipment and medium
CN108322770A (en) Video frequency program recognition methods, relevant apparatus, equipment and system
CN112837683B (en) Voice service method and device
CN112784024B (en) Man-machine conversation method, device, equipment and storage medium
CN113593531B (en) Voice recognition model training method and system
CN116052646B (en) Speech recognition method, device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant