CN112767923B

CN112767923B - Voice recognition method and device

Info

Publication number: CN112767923B
Application number: CN202110008353.6A
Authority: CN
Inventors: 张伟涛
Original assignee: Shanghai Weimeng Enterprise Development Co ltd
Current assignee: Shanghai Weimeng Enterprise Development Co ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-12-23
Anticipated expiration: 2041-01-05
Also published as: CN112767923A

Abstract

The invention discloses a voice recognition method and a voice recognition device. The method and the device have the advantages that the corresponding unvoiced pinyin data is obtained by learning the voice to be recognized, the accuracy of the voice to be recognized can be improved, the matched text is searched from the preset database according to the corresponding unvoiced pinyin data to obtain the recognition result, and compared with the existing method for obtaining the characters corresponding to the voice to be recognized by directly learning, the accuracy of the voice to be recognized can be improved.

Description

Voice recognition method and device

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus.

Background

In the prior art, the method for realizing voice recognition is mainly applied to general scenes, but for some special fields such as catering fields, the voice recognition methods have lower recognition accuracy rate on used proper nouns, and meanwhile, the recognition rate is lower due to the interference of noise of the external environment or other factors in natural scenes.

Disclosure of Invention

In view of the foregoing, it is an object of the present invention to provide a speech recognition method and apparatus capable of improving recognition accuracy.

In order to achieve the purpose, the invention provides the following technical scheme:

a speech recognition method, comprising:

acquiring voice data to be recognized;

according to the voice data to be recognized, a first detection model is used for obtaining unvoiced pinyin data corresponding to the voice data to be recognized;

and searching a text matched with the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data, and outputting the obtained text.

Preferably, retrieving a text matching the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data includes:

according to the obtained unvoiced pinyin data, if a text with unvoiced pinyin consistent with the unvoiced pinyin data is not retrieved from the preset database, character data corresponding to the voice data to be recognized are obtained by using a second detection model according to the obtained unvoiced pinyin data;

and searching a text matched with the unvoiced pinyin data or the character data from the preset database according to the obtained unvoiced pinyin data or the character data, and outputting the obtained text.

and according to the obtained unvoiced pinyin data, if a text with unvoiced pinyin consistent with the unvoiced pinyin data is retrieved from the preset database, outputting the obtained text.

Preferably, the retrieving, from the preset database, a text matched with the unvoiced pinyin data or the text data according to the obtained unvoiced pinyin data or the text data includes:

according to the obtained character data, if the text which is consistent with the character data is not retrieved from the preset database, retrieving the text which meets the requirement of the first similarity between the unvoiced pinyin and the unvoiced pinyin data from the preset database according to the obtained unvoiced pinyin data, retrieving the text which meets the requirement of the second similarity between the text data and the preset database according to the obtained character data, and outputting the obtained text.

Preferably, the method specifically comprises the following steps: and searching a text with the first similarity of the unvoiced pinyin and the unvoiced pinyin data meeting the requirement from the preset database according to the obtained unvoiced pinyin data, searching a text with the second similarity of the character data meeting the requirement from the preset database according to the obtained character data, and merging and de-duplicating the two parts of texts.

Preferably, the method specifically comprises the following steps: and screening out a text meeting the requirement from the texts retrieved from the preset database according to a first similarity between the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, a second similarity between the text retrieved from the preset database and the obtained character data and a common character ratio between the text retrieved from the preset database and the obtained character data.

Preferably, the method specifically comprises the following steps: summing the first similarity of the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, the second similarity of the text retrieved from the preset database and the obtained character data, and the ratio of common characters between the text retrieved from the preset database and the obtained character data, and screening out the text meeting the requirements from the text retrieved from the preset database according to the summation result.

Preferably, according to the obtained unvoiced pinyin data or the obtained text data, retrieving a text matching the unvoiced pinyin data or the text data from the preset database includes:

and according to the obtained character data, if a text consistent with the character data is retrieved from the preset database, outputting the obtained text.

Preferably, the first detection model and the second detection model are obtained by using a data set for training, the data set includes voice data, text data corresponding to the voice and pinyin data corresponding to the voice, the first detection model uses the non-tonal pinyin as a label, and the second detection model uses the text as the label.

A speech recognition apparatus for performing the speech recognition method described above.

According to the technical scheme, the voice recognition method and the voice recognition device provided by the invention have the advantages that the voice data to be recognized are firstly obtained, then the unvoiced pinyin data corresponding to the voice data to be recognized are obtained by using the first detection model according to the voice data to be recognized, the text matched with the unvoiced pinyin data is further retrieved from the preset database according to the obtained unvoiced pinyin data, and the obtained text is output. The voice recognition method and the voice recognition device can acquire the corresponding soundless tone pinyin data by learning the voice to be recognized, can improve the accuracy of the voice to be recognized, and can retrieve the matched text from the preset database according to the corresponding soundless tone pinyin data to acquire the recognition result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech recognition method according to another embodiment of the present invention;

fig. 3 is a flowchart of a method for retrieving a text matching the unvoiced pinyin data or text data from a predetermined database according to the obtained unvoiced pinyin data or text data according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, and it can be seen that the speech recognition method includes the following steps:

s10: and acquiring voice data to be recognized.

The voice data to be recognized is voice data obtained through the voice obtaining device.

S11: and according to the voice data to be recognized, using a first detection model to obtain unvoiced pinyin data corresponding to the voice data to be recognized.

The first detection model takes voice data as input data and takes unvoiced pinyin data as output data. The first detection model obtains unvoiced pinyin data corresponding to the voice data by extracting and learning features from the input voice data.

S12: and searching a text matched with the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data, and outputting the obtained text.

The preset database includes text for matching. And obtaining a recognition result of the voice data to be recognized by retrieving the text matched with the voice data to be recognized from a preset database. In practical application, a corresponding preset database can be established according to a practical application scene.

The text matching the unvoiced pinyin data means that the unvoiced pinyin for the text is at least partially identical to the unvoiced pinyin data. And searching out a text matched with the soundless tone pinyin data from a preset database according to the soundless tone pinyin data corresponding to the obtained speech data to be recognized, and obtaining a recognition result of the speech to be recognized.

The voice recognition method of the embodiment obtains the unvoiced pinyin data corresponding to the voice to be recognized, can improve the accuracy of learning the voice to be recognized, and retrieves the matched text from the preset database according to the corresponding unvoiced pinyin data to obtain the recognition result.

Referring to fig. 2, fig. 2 is a flowchart of a speech recognition method according to another embodiment of the present invention, and it can be seen that the speech recognition method includes the following steps:

s20: and acquiring voice data to be recognized.

The voice data to be recognized is voice data obtained by a voice obtaining device, including but not limited to a microphone.

S21: and according to the voice data to be recognized, using a first detection model to obtain unvoiced pinyin data corresponding to the voice data to be recognized.

S22: and searching a text with the unvoiced pinyin consistent with the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data.

And searching the text with the silent pinyin of the text consistent with the silent pinyin data from a preset database according to the silent pinyin data corresponding to the voice data to be recognized, which is obtained through the first detection model.

S23: and according to the obtained unvoiced pinyin data, if a text with unvoiced pinyin consistent with the unvoiced pinyin data is retrieved from the preset database, outputting the obtained text. Thereby obtaining a recognition result for the voice data to be recognized.

S24: and according to the obtained unvoiced pinyin data, if a text with unvoiced pinyin consistent with the unvoiced pinyin data is not retrieved from the preset database, using a second detection model to obtain character data corresponding to the voice data to be recognized according to the obtained unvoiced pinyin data.

The second detection model takes the silent pinyin data as input data and takes the text data as output data. The second detection model converts the unvoiced pinyin data into corresponding text data by extracting and learning features from the input unvoiced pinyin data.

And if the text with the unvoiced pinyin consistent with the unvoiced pinyin data corresponding to the voice data to be recognized is not searched from the preset database, inputting the unvoiced pinyin data corresponding to the voice data to be recognized into the second detection model to obtain the character data corresponding to the voice data to be recognized.

S25: and searching a text matched with the unvoiced pinyin data or the character data from the preset database according to the obtained unvoiced pinyin data or the character data, and outputting the obtained text.

The text matching the character data means that the text is at least partially identical to the character data. And retrieving a text matched with the phonetic data of the silent tone from a preset database according to the phonetic data of the silent tone corresponding to the phonetic data to be recognized, or/and retrieving a text matched with the character data from the preset database according to the character data corresponding to the obtained phonetic data to be recognized, so as to obtain a recognition result of the phonetic data to be recognized.

Preferably, referring to fig. 3, the step of retrieving the text matching the unvoiced pinyin data or the text data from the preset database according to the obtained unvoiced pinyin data or the text data may specifically include the following steps:

s250: and retrieving a text consistent with the character data from the preset database according to the obtained character data.

And retrieving a text consistent with the character data from a preset database according to the character data corresponding to the voice data to be recognized, which is obtained through the second detection model.

S251: and according to the obtained character data, if a text consistent with the character data is retrieved from the preset database, outputting the obtained text. A recognition result for the speech data to be recognized is obtained.

S252: according to the obtained character data, if a text which is consistent with the character data is not retrieved from the preset database, retrieving a text which meets the requirement of first similarity between the silent pinyin and the silent pinyin data from the preset database according to the obtained silent pinyin data, retrieving a text which meets the requirement of second similarity between the silent pinyin and the character data from the preset database according to the obtained character data, and outputting the obtained text.

The first similarity represents the similarity between two pinyin data, and the second similarity represents the similarity between two text data.

If the text which is consistent with the character data obtained by the second detection model is not searched from the preset database, searching the text which is matched with the unvoiced pinyin data from the preset database according to the obtained unvoiced pinyin data, calculating the first similarity of the text and the unvoiced pinyin data, screening out the text which meets the requirement according to the first similarity, and outputting the text. And searching a text matched with the character data from a preset database according to the obtained character data, calculating a second similarity of the text and the character data, screening out the text meeting the requirement according to the second similarity, and outputting the text.

In practical application, a text with a first similarity meeting the requirement of the unvoiced pinyin and the unvoiced pinyin data can be retrieved from a preset database according to the obtained unvoiced pinyin data, a text with a second similarity meeting the requirement of the text data is retrieved from the preset database according to the obtained text data, and the two texts are merged and deduplicated to obtain a candidate text. Results can be further screened from the candidate text.

Optionally, the text which meets the requirement is screened from the text retrieved from the preset database according to the first similarity between the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, the second similarity between the text retrieved from the preset database and the obtained text data, and the shared character ratio between the text retrieved from the preset database and the obtained text data, so as to output the recognition result of the voice data to be recognized.

Further preferably, the first similarity between the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, the second similarity between the text retrieved from the preset database and the obtained text data, and the ratio of common characters between the text retrieved from the preset database and the obtained text data may be summed, and a text meeting the requirements may be screened from the text retrieved from the preset database according to the summation result.

In practical application, the retrieved matching texts can be sorted according to the size of the summation result of the first similarity, the second similarity and the ratio of the common characters, and the text with a larger summation result is selected from the sorted matching texts and output.

Optionally, the first similarity may be a similarity calculated according to pinyin characters. The second similarity may be a cosine similarity calculated from the representation of the text as a vector. The common character ratio may employ a jaccard coefficient for calculating a ratio of common characters to total characters between two character data.

The method comprises the steps that a first detection model or a second detection model is obtained through pre-training, a data set is used for training to obtain the first detection model and the second detection model, the data set comprises voice data, character data corresponding to the voice and pinyin data corresponding to the voice, the first detection model takes the non-intonation pinyin as a label, and the second detection model takes the characters as the label.

The first detection model or the second detection model may be obtained by training using data common to the corresponding application scenario, and the data included in the data set used may be data common to the corresponding application scenario. In practical applications, the public speech data set may be used when no corpus is available.

The method can be applied to the catering neighborhood, and the established preset database is a dish knowledge base. In one embodiment, the unvoiced pinyin obtained by inputting the speech to be recognized into the first detection model is "hong shao qi zi", and a completely consistent text cannot be retrieved from the dish knowledge base according to the result. Then "hong shao qi zi" is input into the second detection model to obtain the corresponding character "wife braised in soy sauce". And if the completely consistent text cannot be retrieved from the dish knowledge base according to the character result, retrieving the matched text from the dish knowledge base according to the hong shao qi zi and the hong shao wife to obtain the hong shao eggplant, the hong shao pork joint and the hong shao ball which are arranged in the first three, and returning the identification result for the user to select. For data with a null return result or a low score after sorting, the data can be considered as a new dish name or not. Whether the dish name is the dish name or not can be judged by training a language model based on the dish knowledge base.

The voice recognition method of the embodiment obtains the unvoiced pinyin data corresponding to the voice to be recognized through the first detection model, and compared with a method for learning the voice data by taking characters as labels, the voice recognition method greatly reduces the number of the labels, so that the parameter quantity can be reduced in a training model, and the accuracy can be improved.

In addition, in the existing method for learning and recognizing voice data by using characters as labels, a large amount of data sets of special data are needed for training a special neighborhood, and the result is not controllable.

Correspondingly, the embodiment of the invention also provides a voice recognition device, which is used for executing the voice recognition method.

The speech recognition device of the embodiment firstly obtains speech data to be recognized, then uses the first detection model to obtain unvoiced pinyin data corresponding to the speech data to be recognized according to the speech data to be recognized, further retrieves a text matched with the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data, and outputs the obtained text. The voice recognition device obtains the unvoiced pinyin data corresponding to the voice to be recognized, can improve the accuracy of learning the voice to be recognized, retrieves the matched text from the preset database according to the corresponding unvoiced pinyin data to obtain a recognition result, and can improve the accuracy of recognizing the voice to be recognized compared with the existing method for directly learning and obtaining the characters corresponding to the voice to be recognized.

The voice recognition method and device provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring voice data to be recognized;

searching a text matched with the unvoiced pinyin data from a preset database according to the obtained unvoiced pinyin data, and outputting the obtained text;

the retrieving, from a preset database, a text matching the unvoiced pinyin data according to the obtained unvoiced pinyin data includes:

and according to the obtained character data, retrieving a text matched with the character data from the preset database, and outputting the obtained text.

2. The speech recognition method according to claim 1, wherein retrieving, from a preset database, text that matches the unvoiced pinyin data based on the obtained unvoiced pinyin data comprises:

3. The speech recognition method of claim 1, wherein retrieving, from the pre-set database, text matching the text data based on the obtained text data comprises:

according to the obtained character data, if a text which is consistent with the character data is not retrieved from the preset database, retrieving a text which meets the requirement of first similarity between the silent pinyin and the silent pinyin data from the preset database according to the obtained silent pinyin data, retrieving a text which meets the requirement of second similarity between the silent pinyin and the character data from the preset database according to the obtained character data, and outputting the obtained text.

4. The speech recognition method of claim 3, comprising in particular: and searching a text with the first similarity of the unvoiced pinyin and the unvoiced pinyin data meeting the requirement from the preset database according to the obtained unvoiced pinyin data, searching a text with the second similarity of the character data meeting the requirement from the preset database according to the obtained character data, and merging and de-duplicating the two parts of texts.

5. The speech recognition method according to claim 3, comprising in particular: and screening out a text meeting the requirement from the texts retrieved from the preset database according to a first similarity between the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, a second similarity between the text retrieved from the preset database and the obtained character data and a common character ratio between the text retrieved from the preset database and the obtained character data.

6. The speech recognition method of claim 3, comprising in particular: summing a first similarity between the unvoiced pinyin of the text retrieved from the preset database and the obtained unvoiced pinyin data, a second similarity between the text retrieved from the preset database and the obtained text data, and a ratio of common characters between the text retrieved from the preset database and the obtained text data, and screening out a text meeting requirements from the text retrieved from the preset database according to a summation result.

7. The speech recognition method of claim 1, wherein retrieving, from the pre-set database, text matching the text data based on the obtained text data comprises:

8. The speech recognition method of claim 1, wherein the first detection model and the second detection model are obtained by training using a data set, the data set comprises speech data, text data corresponding to the speech, and pinyin data corresponding to the speech, the first detection model is labeled with silent pinyin, and the second detection model is labeled with text.

9. A speech recognition apparatus for performing the speech recognition method of any one of claims 1-8.