CN111798853A - Method, device, equipment and computer readable medium for speech recognition - Google Patents

Method, device, equipment and computer readable medium for speech recognition Download PDF

Info

Publication number
CN111798853A
CN111798853A CN202010230681.6A CN202010230681A CN111798853A CN 111798853 A CN111798853 A CN 111798853A CN 202010230681 A CN202010230681 A CN 202010230681A CN 111798853 A CN111798853 A CN 111798853A
Authority
CN
China
Prior art keywords
text
classifier
voice
user
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010230681.6A
Other languages
Chinese (zh)
Inventor
蒋丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010230681.6A priority Critical patent/CN111798853A/en
Publication of CN111798853A publication Critical patent/CN111798853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method, a device, equipment and a computer readable medium for voice recognition, and relates to the technical field of computers. One embodiment of the method comprises: receiving input voice of a user; converting the input voice of the user into voice text of the user; inputting the voice text of the user into a voice text classifier, receiving a text recognition parameter of the voice text output by the voice text classifier to execute corresponding operation, wherein the voice text classifier is obtained by training a text training parameter and practicing the voice text. The embodiment can improve the accuracy of voice recognition.

Description

Method, device, equipment and computer readable medium for speech recognition
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a computer-readable medium for speech recognition.
Background
In recent years, speech recognition technology has matured to enable people to interact with machines through natural languages, such as: the household appliances, the automobiles and the like are controlled by voice, and the life of people is greatly facilitated.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the problem of low speech recognition accuracy still exists for the specific speech used by the user.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a computer-readable medium for speech recognition, which can improve accuracy of speech recognition.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of speech recognition, including:
receiving input voice of a user;
converting the input voice of the user into voice text of the user;
inputting the voice text of the user into a voice text classifier, receiving a text recognition parameter of the voice text output by the voice text classifier to execute corresponding operation, wherein the voice text classifier is obtained by training a text training parameter and practicing the voice text.
Before the inputting the speech text of the user into the speech text classifier, the method further includes:
training an original classifier by adopting a test set comprising a user voice text to obtain a classifier passing the test;
and training the classifier which passes the test by using the practice voice text, and if the text recognition parameters of the practice voice text output by the classifier which passes the test are consistent with the text training parameters of the practice voice text, obtaining the voice text classifier.
The method for training the original classifier to obtain the classifier passing the test by adopting the test set comprising the plurality of user voice texts comprises the following steps:
after an original classifier is trained by using a training set comprising the voice texts of the same type of users, obtaining an F1 value of the original classifier after the voice texts of the same type of users are trained, wherein the F1 value is used for measuring the classification effect of the classifier;
and (3) the weighted F1 value of the original classifier after the training of the voice texts of the users of various categories is larger than the weighted F1 value of the voice text classifier, and then the original classifier after the training of the voice texts of the users of all categories is used as the classifier which passes the test.
The practice phonetic text is constructed based on the resource name in combination with common practice sentence patterns.
The practical voice texts comprise on-demand user voice texts in the music field and on-demand user voice texts in the audio book field.
The category of the text recognition parameter includes one or more of the following: realm, intent, and slot value.
The practical speech text includes one or more of: high frequency user voice text, on demand user voice text, and historical problem user voice text.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for speech recognition, including:
the receiving module is used for receiving input voice of a user;
the conversion module is used for converting the input voice of the user into the voice text of the user;
and the processing module is used for inputting the voice text of the user into a voice text classifier, receiving the text recognition parameters of the voice text output by the voice text classifier so as to execute corresponding operation, wherein the voice text classifier is obtained by training text training parameters and practice voice text.
The processing module is also used for training an original classifier to obtain a classifier passing the test by adopting a test set comprising the user voice text;
and training the classifier which passes the test by using the practice voice text, and if the text recognition parameters of the practice voice text output by the classifier which passes the test are consistent with the text training parameters of the practice voice text, obtaining the voice text classifier.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for speech recognition, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method as described above.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method as described above.
One embodiment of the above invention has the following advantages or benefits: receiving input voice of a user; converting the input voice of the user into a voice text of the user; inputting the voice text of a user into a voice text classifier, receiving text recognition parameters of the voice text output by the voice text classifier to execute corresponding operations, wherein the voice text classifier is obtained by training text training parameters and practicing the voice text. Training of the classifier can be based not only on text training parameters, but also in terms of practical speech text that relates to the user's particular speech text. Therefore, on the basis of improving the accuracy of voice recognition, the specific voice used by the user can be accurately recognized.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of speech recognition according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an application scenario of a method for speech recognition according to an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of training a classifier according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating training of an original classifier to obtain a classifier that passes a test according to an embodiment of the present invention;
FIG. 5 is a schematic flow diagram for training a classifier using a practice speech text according to an embodiment of the present invention;
fig. 6 is a schematic diagram of the main structure of a speech recognition apparatus according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the process of speech recognition, a classifier needs to be involved. The role of the classifier is to train data using a given class, learn domain classification rules, and then classify the corresponding class for the given input. Wherein the given input may be text converted from the user's speech by speech recognition techniques.
The classification effectiveness of a classifier is typically measured by F1 values, which are defined on a per-class basis, including accuracy and recall. That is, the F1 value is used to measure the classification effectiveness of the classifier. The F1 value is generally used to evaluate the merits of different algorithms as a whole, and is calculated as follows: f1 value ═ accuracy × recall × 2/(accuracy + recall).
Accuracy refers to the proportion of individuals predicted to belong to a certain class, and is also called precision. As an example, there are 100 balls, 50 red balls, and 50 green balls. It is necessary to pick out the red ball and take out 30 balls out of the 100 balls. Among them, 20 red balls were used. Then, the accuracy of picking out the red ball is: 20/30 ═ 0.667, i.e., the correct number of samples extracted/total number of samples extracted.
Recall refers to the proportion of individuals in a data set that belong to a certain category that are correctly predicted, also known as recall. As an example, there are 100 balls, 50 red balls, and 50 green balls. It is necessary to pick out the red ball and take out 30 balls out of the 100 balls. Among them, 20 red balls were used. Then, the recall rate of picking out the red ball is: 20/50 is 0.4, which is the correct number of samples extracted per total number of samples that fit the feature.
Using F1 values on a fixed test data set, the effect of this classifier iteration can be measured in the overall statistical dimension, but the classification effect of an individual cannot be judged. As an example, after the classifier is tuned, the classification of individual a is reasonable; the next time the classifier is tuned in, the value of F1 rises, but the classification effect of individual a is probabilistically worse. In the case of being insensitive to individual classification effect, the F1 value is an effective measure. But in other cases, especially where the individuals are not equal, such as: some individuals are more important, and the classification effect of the individuals needs to be ensured in the subsequent iteration of the classifier. If the overall F1 value is elevated, but such individual classification is not as effective, the test cannot be passed.
In the process of voice recognition, although the F1 value corresponding to the classifier is improved, the accuracy of voice recognition is still low for the specific voice used by the user.
In order to solve the problem that the accuracy of speech recognition is still low for a specific speech used by a user, the following technical scheme in the embodiment of the present invention may be adopted.
Referring to fig. 1, fig. 1 is a schematic diagram of a main flow of a speech recognition method according to an embodiment of the present invention, a speech text classifier is obtained by training text training parameters and practice speech texts, and the speech text classifier is received to output text recognition parameters to execute corresponding operations. As shown in fig. 1, the method specifically comprises the following steps:
referring to fig. 2, fig. 2 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present invention, where fig. 2 includes a user, a terminal, and a server. The user inputs voice to the terminal, and the terminal can forward the received input voice of the user to the server, and the server performs voice recognition. And after the voice recognition is successful, the server feeds back the operation corresponding to the input voice to the terminal. It is understood that the server in fig. 2 is an execution subject that executes the steps in fig. 1.
S101, receiving input voice of a user.
The user can input voice to the server through the terminal. As one example, an Application (APP) is preset in a terminal. After the user opens the APP, speech may be input to the APP. After receiving the input voice of the user, the APP can be sent to the server through the network, and the server receives and identifies the input voice of the user.
It is understood that the input voice of the user may be a real-time voice recording of the user, or may be an existing voice recording of the user. As an example, a user inputs speech to an APP in a terminal, where the speech may be speech uttered by the current user, that is, a real-time speech recording of the user; the voice may be a voice recording that the user has called from a voice recording stored in the terminal, and the called voice is a voice recording that the user has already recorded.
And S102, converting the input voice of the user into the voice text of the user.
Speech Recognition technology (ASR) aims at converting the vocabulary content of Speech into readable input, such as keystrokes, binary codes or character sequences.
The speech recognition technology may include preprocessing, feature extraction, pattern matching, etc. First, input speech is pre-processed, where the pre-processing includes framing, windowing, pre-emphasis, and so on. Secondly, feature extraction is carried out, so that the selection of proper feature parameters is particularly important. Commonly used characteristic parameters include: pitch period, formants, short-term average energy or amplitude, Linear Prediction Coefficients (LPC), perceptual weighted prediction coefficients (PLP), short-term average zero-crossing rate, Linear Prediction Cepstral Coefficients (LPCC), autocorrelation functions, mel-frequency cepstral coefficients (MFCC), wavelet transform coefficients, empirical mode decomposition coefficients (EMD), gamma-pass filter coefficients (GFCC), and the like. When actual recognition is carried out, a template is generated for the test voice according to a training process, and finally recognition is carried out according to a distortion judgment criterion. Common distortion decision criteria include euclidean distance, covariance matrix, bayesian distance, and the like.
In the embodiment of the invention, the input voice of the user can be converted into the voice text of the user through the voice recognition technology. Speech text is textual content that is relevant to the needs in the user's input speech.
As an example, the user's speech text is: i want to drink a cup of coffee. As another example, the user's speech text is: a song of singer a is played. It can be understood that the voice text includes the user's requirement, and the user's requirement can be obtained through the voice text.
In the embodiment of the present invention, a terminal based on a speech recognition technology, for example: the smart speaker, the voice text of the user interacting with the terminal, may be referred to as user voice text.
S103, inputting the voice text of the user into a voice text classifier, receiving the text recognition parameters of the voice text output by the voice text classifier to execute corresponding operation, wherein the voice text classifier is obtained by training the text training parameters and practicing the voice text.
In the embodiment of the invention, the classifier provides an interface, the input is the voice text of the user, and the output is the text recognition parameter of the voice text. The text recognition parameters are parameters that characterize the speech text. As one example, the category of text recognition parameters may include one or more of: domain, intent, and slot.
The domain is a parameter that characterizes the scope to which the speech text belongs. The intent is to characterize the parameters for the purpose of the speech text. The bin value is a feature that characterizes the phonetic text. As an example, the phonetic text: the song of singer a is played. The field is as follows: music; intention is: playing general music; the groove value is as follows: singer: singer A.
After the classifier outputs the text recognition parameters of the voice text, corresponding operations can be executed according to the text recognition parameters. As an example, the phonetic text: and playing the song of the singer A, and executing corresponding operation according to the slot value: and (3) broadcasting singers: the song of singer a.
The classifier corresponds to a training set and a test set. The training set and the test set are the voice texts which are marked by the marking personnel, namely: the fields, intentions and slot values are manually marked. The training set is used for classifier learning and training. The test set measures the learning effect of the classifier by comparing the actual result with the expected result. It should be noted that the phonetic text in the training set is different from the phonetic text in the test set.
Referring to table 1, table 1 is a schematic diagram of the input phonetic text of the classifier, and the correct output text training parameters.
TABLE 1
Figure BDA0002429190750000081
It is understood that the contents of table 1 belong to the training set. And inputting the voice text into a classifier, and if the text recognition parameters output by the classifier are consistent with the text training parameters in the table 1, the training can be considered to be finished. It should be noted that the types of the text training parameters are consistent with the types of the text recognition parameters.
As an example, the phonetic text: continuing to play the light music and inputting the light music into the classifier; and the text recognition parameters output by the classifier are consistent with the following text training parameters, and the classifier finishes training. The text training parameters include: the expected area is: music; the intended purpose is: continuing; expected slot value: continue playing < MUSIC _ TAG > light MUSIC </MUSIC _ TAG >. Where MUSIC _ TAG is a TAG to play MUSIC.
In embodiments of the present invention, the classifier test results, not only relate to the F1 value, but also take into account other conditions. The F1 value is certainly of concern and needs to be higher than before to indicate that the overall effect is optimal. However, the increased value of F1 does not guarantee accurate recognition of the classification effect of the particular speech used by the user. The classification effect of the specific speech used by the user is also a necessary condition for the classifier to pass the test.
In an embodiment of the invention, the speech-to-text classifier is derived from text training parameters and practicing speech-to-text training. The practical speech text is the speech text involved in the actual application process of the classifier. Therefore, the classifier can be combined with practice speech text training on the basis of utilizing the training set to train, and the speech recognition accuracy is improved.
Referring to fig. 3, fig. 3 is a schematic flowchart of training a classifier according to an embodiment of the present invention, which specifically includes:
s301, training an original classifier by adopting a test set comprising a user voice text to obtain a classifier passing the test.
In the embodiment of the invention, the purpose of the training test set is to improve the overall classification effect, and the accuracy of speech recognition needs to be improved for specific speech. Then, the classifier with the improved F1 value needs to be trained by using practical speech texts on the basis of improving the F1 value of the classifier.
Referring to fig. 4, fig. 4 is a schematic flowchart of training an original classifier to obtain a classifier that passes a test according to an embodiment of the present invention, which specifically includes:
s401, after training an original classifier by using a training set comprising the voice texts of the same type of users, obtaining an F1 value of the original classifier after training the voice texts of the same type of users, wherein the F1 value is used for measuring the classification effect of the classifier.
In an embodiment of the present invention, there is a corresponding classifier for each class. As an example, the categories include: music and sound reading materials, the music class corresponds to the classifier 1, and the sound reading material class corresponds to the classifier 2.
Then, the original classifier can be trained according to the training set of the voice texts of the same type of users, and the F1 value of the original classifier after the voice texts of the same type of users are trained can be obtained. The training set of the user voice texts of the same type comprises user voice texts which are artificially marked and belong to the same type. As an example, a training set of homogeneous user speech texts includes more than 2 ten thousand user speech texts of the same category.
Then, for the categories, include: music and audio books, classifier 1 may be trained using a training set including music-like user speech text and classifier 2 may be trained using a training set including audio book-like user speech text. Wherein, the classifier 1 and the classifier 2 both belong to an original classifier.
After the original classifier is trained, the F1 value of the original classifier after the training of the voice text of the same type of user can be obtained. The original classifiers after training the voice texts of a plurality of classes of users correspond to respective F1 values. And further obtains the weighted F1 value of the original classifier after training the voice text of all classes of users. Wherein the weight of each category may be preset.
Wherein the content of the first and second substances,
Figure BDA0002429190750000091
m is the number of classes, F1iIs the F1 value, Ratio, of the original classifier after the training of the i-th class of user speech textsiIs to preset the weight of the ith category.
S402, the weighted F1 value of the original classifier after the voice texts of the users of various categories are trained is larger than the weighted F1 value of the voice text classifier, and then the original classifier after the voice texts of the users of all categories are trained is used as the classifier which passes the test.
To ensure that the overall classification effect is improved, the weighted F1 value needs to be greater than the weighted F1 of the previous speech text classifier. That is, the weighted F1 value of the original classifier after the training of the speech text of the users of various categories is larger than the weighted F1 value of the speech text classifier. The speech text classifier is a classifier which needs to be input in the current speech recognition process.
In the case that the weighted F1 performance is not ideal, the F1 values of the original classifier after training of the speech text of each class of users can be used for data analysis to find problems. The classification effect is better than before only when the weighted F1 value is higher than the weighted F1 value of the speech-text classifier. Of course, as the weighted F1 value gradually approaches the ideal value, F1 of the original classifier after training of each category of user speech text can be determined again by updating the user speech text in the test set.
In the embodiment of fig. 4, the original classifier is trained by using the test set, so that the classification effect is improved as a whole, and the classifier passing the test is obtained.
S302, training the tested classifier by using the practical voice text, and if the text recognition parameters of the practical voice text output by the tested classifier are consistent with the text training parameters of the practical voice text, obtaining the voice text classifier.
In the embodiment of the invention, after the original classifier is trained by adopting the test set to obtain the classifier which passes the test, the practical speech text is used for training the classifier which passes the test.
And when the text recognition parameters of the practical speech text output by the tested classifier are consistent with the text training parameters of the practical speech text, the classifier is proved to be trained, and finally the speech text classifier is obtained.
In embodiments of the present invention, practicing the phonetic text includes one or more of: high frequency user voice text, on demand user voice text, and historical problem user voice text.
The high frequency user speech text is the speech text that is used the most often by the user. The on-demand user voice text is the voice text for the on-demand resource. Such as: requesting songs, requesting movies or requesting art programs, etc. Historical problem user speech text is speech text that once the classifier was unable to classify correctly.
In the embodiment of the invention, the text training parameters of the high-frequency user voice text are manually marked, and the output result of the classifier is consistent with the manually marked parameters, so that the accuracy can reach 100 percent. As one example, the high frequency user speech text includes the top 1000 pieces of speech text that are most frequently entered by the user.
Illustratively, the high frequency user speech text includes base function related speech text and resource related speech text.
As one example, the base function includes control instructions. Control instructions, such as: pause, continue playing, next, set an alarm clock, etc. The basic function involves less speech text over time.
Resource-related relates to speech text, which is strongly related to current social or network hotspots, such as: short video medicated leaven, hot broadcast movie, hot news event, and the like, and the method has obvious timeliness and neutrality. So when it is detected that the resource is related to a new degree of popularity of the speech text, the new degree of popularity of speech text can be added to the high frequency user speech text.
Referring to fig. 5, fig. 5 is a flowchart illustrating training a classifier using practical speech texts according to an embodiment of the present invention, in order to ensure that the classifier can correctly output text recognition parameters of different practical manners, the classifier that passes the test may be trained using practical speech texts. The method specifically comprises the following steps:
s501, based on the resource names, combining common practice sentence patterns, and constructing practice user voice texts.
In one embodiment of the invention, the practical user speech text may be pre-constructed to train the classifier. It will be appreciated that the practice phonetic text is constructed based on the resource name in conjunction with common practice patterns.
The following takes the practical user speech text specifically as the on-demand user speech text as an example for illustrative explanation.
On-demand user speech text may be divided by domain. Illustratively, the on-demand user speech text includes on-demand user speech text in the music domain and on-demand user speech text in the audio book domain. As one example, the area of audiobooks includes one or more of jokes, voices, audiobooks, and stations.
First, common different on-demand sentence patterns can be selected in the field. As an example, i want to listen to a song, play music, i want to listen to song B of singer B, come to roll, etc. "i want to listen to song B of singer B" and "i want to listen to song a of singer a" are considered to be the same sentence pattern.
Secondly, for different on-demand sentence patterns, the same resource name is used instead, so as to remove the influence of the variable of the resource name on the result of the classifier. Because condensed phonetic text is mainly to verify whether the classifier can handle different sentence patterns, not resources.
As an example, the name of the singer concerned, the "singer B" is used uniformly, the "song B" is used uniformly for the song name, the "rock" is used uniformly for the song style, the "artist C" is used uniformly for the artist, and the like. Applied to the voice text, the original voice text is 'I want to listen to singer A' and 'I want to listen to singer B'; the unified replacement is 'I want to listen to singer B' and 'I want to listen to singer B'.
It will be appreciated that on-demand user speech text can be constructed based on the resource name in combination with common on-demand sentence patterns.
S502, training the tested classifier by using the practical voice text, and if the text recognition parameters of the practical voice text output by the tested classifier are consistent with the text training parameters of the practical voice text, obtaining the voice text classifier.
The tested classifier can be trained by using the on-demand voice text, and the voice text classifier can be obtained if the text recognition parameter of the on-demand voice text output by the tested classifier is consistent with the text training parameter of the on-demand voice text.
And ensuring that the classification accuracy of different on-demand voice texts reaches 100 percent, namely that the text recognition parameters of the on-demand voice texts output by the tested classifier are consistent with the text training parameters of the on-demand voice texts. The influence of resource name variables such as singers, song names, artists, stories and the like is eliminated, and only the accuracy of the on-demand voice text is considered. This has two benefits, one is to evaluate the on-demand sentence pattern coverage capability; 2 is the problem of primarily determining whether the sentence is on demand or the resource name in the troubleshooting.
In the embodiment of FIG. 5, the classifier that passes the test is trained based on the practical speech text to ensure that the classifier is able to correctly output text recognition parameters for different practical user speech texts.
In an embodiment of the present invention, in an iterative process of a classifier, it is necessary to ensure that all history problems are repaired, and a problem that text recognition parameters of a user speech text of the history problems cannot be correctly output again in the updated classifier cannot occur.
Specifically, user speech texts that have historically failed to correctly output text recognition parameters can be grouped into historically problematic user speech texts. Training a classifier which passes the test by using the voice text of the user with the historical problem, and obtaining the voice text classifier after the classifier which passes the test can correctly output the text recognition parameters.
In embodiments of the present invention, where the practiced speech text includes a plurality of user speech texts, the classifier may be trained using the plurality of user speech texts. Training of the multiple user speech texts does not take into account the sequencing of the training.
In the technical solution in the embodiment of the present invention, an input voice of a user is received; converting the input voice of the user into a voice text of the user; inputting the voice text of the user into a voice text classifier, receiving text recognition parameters of the voice text output by the voice text classifier to execute corresponding operation, wherein the voice text classifier is obtained by training text training parameters and practicing the voice text. Training of the classifier can be based not only on text training parameters, but also in terms of practical speech text that relates to the user's particular speech text. Therefore, on the basis of improving the accuracy of voice recognition, the specific voice used by the user can be accurately recognized.
In practice, the technical scheme of the embodiment of the invention is not adopted, the iteration cycle of the classifier is more than 2 weeks, and the classification effect of the high-frequency user voice text cannot be ensured. By adopting the technical scheme of the embodiment of the invention, the average iteration cycle of the classifier is as follows: 1 day, the classifier iterates a maximum of 12 times a week and no functional on-line failures occur.
The technical effects are obtained by taking a four-point test principle as a guide:
1. the client is as follows: measuring the test result from the perspective of a user; the effect is improved from the user perspective.
2. Digital speaking: any result is numerically measured.
3. Independent measurements: all requirements are independent and separated, and the effect is convenient to judge.
4. Setting a benchmark: each iteration has a reference value for comparison and is better than the reference.
After the testing principle is determined, an efficient automatic testing tool is required to support, and the requirements are easy to use, quick, free of learning cost and diffusible. The test contents are mainly used for determining the requirements of the on-line classifier and selecting a proper test set for different requirements.
Referring to fig. 6, fig. 6 is a schematic diagram of a main structure of a speech recognition apparatus according to an embodiment of the present invention, where the speech recognition apparatus may implement a speech recognition method, and as shown in fig. 6, the speech recognition apparatus specifically includes:
the receiving module 601 is configured to receive an input voice of a user.
A conversion module 602, configured to convert the input speech of the user into a speech text of the user.
The processing module 603 is configured to input the user's speech text into a speech text classifier, and receive a text recognition parameter of the speech text output by the speech text classifier to perform a corresponding operation, where the speech text classifier is obtained by training a text training parameter and practicing the speech text.
In an embodiment of the present invention, the processing module 603 is specifically configured to train an original classifier to obtain a classifier that passes a test by using a test set including a user speech text;
training a tested classifier by using the practical voice text, and if the text recognition parameters of the practical voice text output by the tested classifier are consistent with the text training parameters of the practical voice text, obtaining the voice text classifier.
In an embodiment of the present invention, the processing module 603 is specifically configured to obtain an F1 value of the original classifier after training the similar user speech texts by using a training set including the similar user speech texts, where the F1 value is used to measure a classification effect of the classifier;
and (3) the weighted F1 value of the original classifier after the voice texts of the users of various categories are trained is larger than the weighted F1 value of the voice text classifier, and then the original classifier after the voice texts of the users of all categories are trained is used as the classifier which passes the test.
In one embodiment of the invention, practice phonetic text is constructed based on resource names in combination with common practice patterns.
In one embodiment of the invention, the practical user speech text comprises on-demand user speech text in the music domain and on-demand user speech text in the audio book domain.
In one embodiment of the invention, the text recognition parameters include one or more of the following parameters: realm, intent, and slot value.
In one embodiment of the invention, practicing the phonetic text includes one or more of: high frequency user voice text, on demand user voice text, and historical problem user voice text.
Fig. 7 shows an exemplary system architecture 700 to which the method of speech recognition or the apparatus of speech recognition of an embodiment of the present invention may be applied.
As shown in fig. 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The terminal devices 701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 701, 702, 703. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for speech recognition provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the apparatus for speech recognition is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a transmitting unit, an obtaining unit, a determining unit, and a first processing unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the sending unit may also be described as a "unit sending a picture acquisition request to a connected server".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
receiving input voice of a user;
converting the input voice of the user into voice text of the user;
inputting the voice text of the user into a voice text classifier, receiving a text recognition parameter of the voice text output by the voice text classifier to execute corresponding operation, wherein the voice text classifier is obtained by training a text training parameter and practicing the voice text.
According to the technical scheme of the embodiment of the invention, the input voice of a user is received; converting the input voice of the user into a voice text of the user; inputting the voice text of a user into a voice text classifier, receiving text recognition parameters of the voice text output by the voice text classifier to execute corresponding operations, wherein the voice text classifier is obtained by training text training parameters and practicing the voice text. Training of the classifier can be based not only on text training parameters, but also in terms of practical speech text that relates to the user's particular speech text. Therefore, on the basis of improving the accuracy of voice recognition, the specific voice used by the user can be accurately recognized.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method of speech recognition, comprising:
receiving input voice of a user;
converting the input voice of the user into voice text of the user;
inputting the voice text of the user into a voice text classifier, and receiving text recognition parameters of the voice text output by the voice text classifier to execute corresponding operations, wherein the voice text classifier is obtained by training text training parameters and practice voice text.
2. The method of speech recognition according to claim 1, wherein prior to entering the user's speech text into a speech text classifier, further comprising:
training an original classifier by adopting a test set comprising a user voice text to obtain a classifier passing the test;
and training the classifier which passes the test by using the practice voice text, and if the text recognition parameters of the practice voice text output by the classifier which passes the test are consistent with the text training parameters of the practice voice text, obtaining the voice text classifier.
3. The method of speech recognition according to claim 2, wherein training the original classifier using a test set comprising a plurality of user speech texts to obtain a classifier that passes the test comprises:
after an original classifier is trained by using a training set comprising the voice texts of the same type of users, obtaining an F1 value of the original classifier after the voice texts of the same type of users are trained, wherein the F1 value is used for measuring the classification effect of the classifier;
and (3) the weighted F1 value of the original classifier after the training of the voice texts of the users of various categories is larger than the weighted F1 value of the voice text classifier, and then the original classifier after the training of the voice texts of the users of all categories is used as the classifier which passes the test.
4. The method of claim 2, wherein the practice phonetic text is constructed based on resource names in combination with common practice sentences.
5. The method of speech recognition according to claim 4, wherein the practical speech text comprises on-demand user speech text in the music domain and on-demand user speech text in the audio book domain.
6. The method of speech recognition according to claim 1, wherein the categories of the text recognition parameters include one or more of: realm, intent, and slot value.
7. The method of speech recognition according to claim 1, wherein the practical speech text comprises one or more of: high frequency user voice text, on demand user voice text, and historical problem user voice text.
8. An apparatus for speech recognition, comprising:
the receiving module is used for receiving input voice of a user;
the conversion module is used for converting the input voice of the user into the voice text of the user;
and the processing module is used for inputting the voice text of the user into a voice text classifier, receiving the text recognition parameters of the voice text output by the voice text classifier so as to execute corresponding operation, wherein the voice text classifier is obtained by adopting text training parameters and practicing voice text training.
9. The speech recognition apparatus of claim 8, wherein the processing module is further configured to train an original classifier to obtain a classifier that passes the test using a test set including user speech text;
and training the classifier passing the test by using the practice voice text, and if the text recognition parameters of the practice voice text output by the classifier passing the test are consistent with the preset parameters of the practice voice text, obtaining the voice text classifier.
10. An electronic device for speech recognition, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
11. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010230681.6A 2020-03-27 2020-03-27 Method, device, equipment and computer readable medium for speech recognition Pending CN111798853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010230681.6A CN111798853A (en) 2020-03-27 2020-03-27 Method, device, equipment and computer readable medium for speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010230681.6A CN111798853A (en) 2020-03-27 2020-03-27 Method, device, equipment and computer readable medium for speech recognition

Publications (1)

Publication Number Publication Date
CN111798853A true CN111798853A (en) 2020-10-20

Family

ID=72806637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010230681.6A Pending CN111798853A (en) 2020-03-27 2020-03-27 Method, device, equipment and computer readable medium for speech recognition

Country Status (1)

Country Link
CN (1) CN111798853A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630613A (en) * 2021-07-30 2021-11-09 出门问问信息科技有限公司 Information processing method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040215445A1 (en) * 1999-09-27 2004-10-28 Akitoshi Kojima Pronunciation evaluation system
CN107247768A (en) * 2017-06-05 2017-10-13 北京智能管家科技有限公司 Method for ordering song by voice, device, terminal and storage medium
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames
CN109241261A (en) * 2018-08-30 2019-01-18 武汉斗鱼网络科技有限公司 User's intension recognizing method, device, mobile terminal and storage medium
CN109902173A (en) * 2019-01-31 2019-06-18 青岛科技大学 A kind of Chinese Text Categorization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040215445A1 (en) * 1999-09-27 2004-10-28 Akitoshi Kojima Pronunciation evaluation system
CN107247768A (en) * 2017-06-05 2017-10-13 北京智能管家科技有限公司 Method for ordering song by voice, device, terminal and storage medium
CN108596470A (en) * 2018-04-19 2018-09-28 浙江大学 A kind of power equipments defect text handling method based on TensorFlow frames
CN109241261A (en) * 2018-08-30 2019-01-18 武汉斗鱼网络科技有限公司 User's intension recognizing method, device, mobile terminal and storage medium
CN109902173A (en) * 2019-01-31 2019-06-18 青岛科技大学 A kind of Chinese Text Categorization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630613A (en) * 2021-07-30 2021-11-09 出门问问信息科技有限公司 Information processing method, device and storage medium
CN113630613B (en) * 2021-07-30 2023-11-10 出门问问信息科技有限公司 Information processing method, device and storage medium

Similar Documents

Publication Publication Date Title
CN107767869B (en) Method and apparatus for providing voice service
CN110349564B (en) Cross-language voice recognition method and device
US10133538B2 (en) Semi-supervised speaker diarization
CN107590172B (en) Core content mining method and device for large-scale voice data
CN105244026B (en) A kind of method of speech processing and device
WO2022105861A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN1662956A (en) Mega speaker identification (ID) system and corresponding methods therefor
CN109410986B (en) Emotion recognition method and device and storage medium
CN110557589A (en) System and method for integrating recorded content
CN109582825B (en) Method and apparatus for generating information
CN108877779B (en) Method and device for detecting voice tail point
US20230091272A1 (en) Audio content recognition method and apparatus, and device and computer-readable medium
CN109949798A (en) Commercial detection method and device based on audio
CN112509562A (en) Method, apparatus, electronic device and medium for text post-processing
US8868419B2 (en) Generalizing text content summary from speech content
CN107680584B (en) Method and device for segmenting audio
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN111243595A (en) Information processing method and device
CN113257283A (en) Audio signal processing method and device, electronic equipment and storage medium
CN110889008B (en) Music recommendation method and device, computing device and storage medium
CN108962226B (en) Method and apparatus for detecting end point of voice
CN111798853A (en) Method, device, equipment and computer readable medium for speech recognition
CN112992155B (en) Far-field voice speaker recognition method and device based on residual error neural network
CN115206321A (en) Voice keyword recognition method and device and electronic equipment
CN114155845A (en) Service determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination