CN113889115A

CN113889115A - Dialect commentary method based on voice model and related device

Info

Publication number: CN113889115A
Application number: CN202111151990.5A
Authority: CN
Inventors: 乔文杰
Original assignee: Easy City Square Network Technology Co ltd
Current assignee: Easy City Square Network Technology Co ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-04

Abstract

The application discloses a dialect commentary method and a related device based on a voice model, wherein the method comprises the following steps: acquiring a voice text, and determining a transfer intention corresponding to the voice text through a pre-trained intention recognition model; when the transfer intention is to transfer the dialect, determining a dialect area and a transfer text corresponding to the voice text through a pre-trained entity recognition model; and searching dialect texts corresponding to the converted texts in a preset database based on the dialect areas and the converted texts, and converting the voice texts into dialect voices based on the dialect texts. According to the dialect transferring method and device, the common dialects in all dialect areas are integrated through the preset database, then the intention recognition model and the entity recognition model are used for determining the transfer statement needing to be converted into the dialect to be transferred and the dialect area corresponding to the transfer statement, and finally the dialect voice corresponding to the transfer statement is selected from the preset database, so that the accuracy of transferring the dialects can be improved, and convenience is brought to the use of a user.

Description

Dialect commentary method based on voice model and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a dialect commentary method and a related apparatus based on a speech model.

Background

At present, the response of the existing voice assistant to common questions is humanized, but the expression capacity is limited by the content of the training corpus and the generalization capacity of the model to a great extent, and the existing voice assistant usually presents less intelligent to the problems which are not common or have no standard answers. For example, when a sentence is asked how to speak in a dialect, there are often times when there is a question or wrong utterance. This limits the use of the voice assistant on the one hand and also makes the user inconvenient to use on the other hand.

Thus, the prior art has yet to be improved and enhanced.

Disclosure of Invention

The technical problem to be solved by the present application is to provide a dialect commentary method and a related apparatus based on a speech model, aiming at the deficiencies of the prior art.

In order to solve the technical problem, a first aspect of the embodiments of the present application provides a dialect transferring method based on a speech model, where the method includes:

acquiring a voice text, and determining a transfer intention corresponding to the voice text through a pre-trained intention recognition model;

when the rephrasing intention is to rephrase the dialect, determining a dialect area and a rephrase text corresponding to the voice text through a pre-trained entity recognition model;

and searching dialect voice corresponding to the transcription text in a preset database based on the dialect area and the transcription text so as to convert the voice text into the dialect voice.

Before the obtaining of the speech text and the determining of the transcription intention corresponding to the speech text by the pre-trained intention recognition model, the method further comprises:

the method comprises the steps of constructing a preset database, wherein the preset database comprises a plurality of data groups, and each data group in the data groups comprises a dialect area, a dialect text belonging to the dialect area, dialect voice corresponding to the dialect text and a mandarin text corresponding to the dialect text.

The dialect rephrasing method based on the voice model is characterized in that the intention recognition model and the entity recognition model are both pre-trained bert models.

The dialect paraphrasing method based on the speech model, wherein the searching for the dialect text corresponding to the paraphrasing text in a preset database based on the dialect area and the paraphrasing text specifically includes:

searching all reference data groups corresponding to the dialect area in the preset database;

and searching a Mandarin text matched with the transcription text in all the searched reference data groups, and taking the dialect text corresponding to the Mandarin text as the dialect text corresponding to the transcription text.

The dialect rephrasing method based on the voice model, wherein the searching for the mandarin text matched with the rephrasing text in all the searched reference data sets specifically comprises:

searching a target mandarin text which is the same as the text content of the rephrased text in all reference data groups;

if the target Mandarin text is found, taking the target Mandarin text as the Mandarin text matched with the rephrase transferring text;

if the target Mandarin text is not found, determining the similarity between the Mandarin text in each reference data set and the transcribed text through a pre-trained bert model, and determining the Mandarin text matched with the transcribed text based on the similarity.

The dialect paraphrasing method based on the voice model, wherein the determining of the mandarin text matched with the paraphrasing text based on the similarity specifically includes:

selecting a candidate data group with the similarity larger than a preset similarity threshold from all the reference data groups;

when the candidate data group is selected, the Mandarin text in the candidate data group with the maximum similarity in the candidate data group is used as the Mandarin text matched with the rephrase text;

and if the candidate data set is not selected, taking the default text as the Mandarin text matched with the rephrase text.

The dialect transferring method based on the speech model, wherein after searching the dialect speech corresponding to the transfer text in a preset database based on the dialect area and the transfer text to convert the speech text into the dialect speech, the method further comprises:

and playing the dialect voice through a voice playing device.

A second aspect of the embodiments of the present application provides a dialect transferring apparatus based on a speech model, where the apparatus includes:

the acquisition module is used for acquiring a voice text and determining a transfer intention corresponding to the voice text through a pre-trained intention recognition model;

the determining module is used for determining a dialect area corresponding to the voice text and the transliteration text through a pre-trained entity recognition model when the transliteration intention is to transliterate the dialect;

and the conversion module is used for searching dialect voice corresponding to the transcription text in a preset database based on the dialect area and the transcription text so as to convert the voice text into the dialect voice.

A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the method for dialect transcription based on a speech model as described in any of the above.

A fourth aspect of the embodiments of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, performs the steps of any of the above methods for dialect transfer based on a speech model.

Has the advantages that: compared with the prior art, the application provides a dialect transferring method and a related device based on a voice model, wherein the method comprises the following steps: acquiring a voice text, and determining a transfer intention corresponding to the voice text through a pre-trained intention recognition model; when the rephrasing intention is to rephrase the dialect, determining a dialect area and a rephrase text corresponding to the voice text through a pre-trained entity recognition model; and searching dialect voice corresponding to the transcription text in a preset database based on the dialect area and the transcription text so as to convert the voice text into the dialect voice. According to the dialect transferring method and device, the common dialects in all dialect areas are integrated through the preset database, then the intention recognition model and the entity recognition model are used for determining the transfer statement needing to be converted into the dialect to be transferred and the dialect area corresponding to the transfer statement, and finally the dialect voice corresponding to the transfer statement is selected from the preset database, so that the accuracy of transferring the dialects can be improved, and convenience is brought to the use of a user.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without any inventive work.

FIG. 1 is a flow chart of a method for dialect transcription based on a speech model provided in the present application.

FIG. 2 is a flowchart illustrating a method for dialect transcription based on a speech model according to the present application.

Fig. 3 is a schematic diagram of a model structure of an entity recognition model in the speech model-based dialect remarking method provided in the present application.

Fig. 4 is a schematic structural diagram of a dialect transcription device based on a speech model according to the present application.

Fig. 5 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

The present application provides a dialect commentary method and related apparatus based on a speech model, and in order to make the purpose, technical solution and effect of the present application clearer and clearer, the present application will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

The inventor finds that the existing voice assistant is humanized in response to common questions, but the expression capacity is limited by the content of the training corpus and the generalization capacity of the model to a great extent, and the existing voice assistant usually represents less intelligent in terms of the problems which are not common or have no standard answers. For example, when a sentence is asked how to speak in a dialect, there are often times when there is a question or wrong utterance. This limits the use of the voice assistant on the one hand and also makes the user inconvenient to use on the other hand.

In order to solve the above problem, in the embodiment of the present application, a speech text is obtained, and a transfer intention corresponding to the speech text is determined through a pre-trained intention recognition model; when the rephrasing intention is to rephrase the dialect, determining a dialect area and a rephrase text corresponding to the voice text through a pre-trained entity recognition model; and searching dialect voice corresponding to the transcription text in a preset database based on the dialect area and the transcription text so as to convert the voice text into the dialect voice. According to the dialect transferring method and device, the common dialects in all dialect areas are integrated through the preset database, then the intention recognition model and the entity recognition model are used for determining the transfer statement needing to be converted into the dialect to be transferred and the dialect area corresponding to the transfer statement, and finally the dialect voice corresponding to the transfer statement is selected from the preset database, so that the accuracy of transferring the dialects can be improved, and convenience is brought to the use of a user. .

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The embodiment provides a dialect rephrasing method based on a speech model, as shown in fig. 1 and fig. 2, the method includes:

and S10, acquiring the voice text, and determining the rephrasing intention corresponding to the voice text through a pre-trained intention recognition model.

Specifically, the intention recognition model is a trained neural network model and is used for recognizing the rephrase intention in the voice text, wherein the rephrase intention comprises a rephrase dialect and a non-rephrase dialect. It will be appreciated that the input items of the intent recognition model are phonetic text and the output items are transliterated intentions, that is, whether the output items are transliterated dialects or not. The voice text is a mandarin speech text, that is, the voice text is voice information that is spoken by mandarin, where the voice text may be picked up by an electronic device running the dialect commentary method based on the voice model provided in this embodiment, may be sent to the electronic device by an external device, may be obtained through a cloud, and the like. In a typical implementation manner of this embodiment, the voice text is picked up by an electronic device running the dialect transcription method based on the voice model provided in this embodiment, the electronic device is configured with a voice pickup device and a voice broadcast device, the voice pickup device is configured to pick up the voice text to be transcribed, and the voice broadcast device is configured to play voice information.

In an implementation manner of this embodiment, the intention recognition model may be constructed based on a bert model, where the intention recognition model includes a bert model and two classification output layers, the two classification output layers are connected to the bert model, and the intention is output through the two classification output layers, where the bert model is a pre-trained bert model, and thus after the intention recognition model is constructed and formed, the bert model may be fine-tuned to obtain the intention recognition model, so that a training speed of the intention recognition model may be improved. In addition, the training samples used for the fine-tuning training of the intention recognition model may include mandarin corpus and the referral intention corresponding to the mandarin corpus, for example, the mandarin corpus is "i like you how to speak in cantonese", the referral intention corresponding to the mandarin corpus is to refer to the dialect, or the mandarin corpus is "i like you well", and the referral intention corresponding to the mandarin corpus is not to refer to the dialect.

And S20, when the rephrasing intention is to rephrase the dialect, determining a dialect area corresponding to the voice text and the rephrase text through a pre-trained entity recognition model.

Specifically, the entity recognition model is a trained neural network model, and is used for recognizing dialect areas and transliteration texts in the voice texts, that is, the input items of the entity recognition model are the voice texts, and the output items are reaction areas and transliteration texts written in the voice texts. For example, the corpus corresponding to the voice text is "how do i like you in cantonese", after the voice text is input into the entity recognition model, the output item of the entity recognition model is the dialect area corresponding to cantonese, and the text is transferred to be that i like you.

In one implementation manner of this embodiment, as shown in fig. 3, the entity recognition model includes a pre-trained bert module, a BiLSTM module, and a CRF module, where the bert module is a pre-trained bert model, and the speech text is spoken through the bert model. The training sample adopted by the entity recognition model can comprise a training corpus, a labeled dialect area and a labeled rephrase text corresponding to the training corpus. For example, the training corpus is "how do i like you in cantonese", the dialect area is labeled as the dialect area corresponding to cantonese, and the text is labeled as that i like you.

S30, searching dialect voice corresponding to the transliteration text in a preset database based on the dialect area and the transliteration text, so as to convert the voice text into dialect voice.

Specifically, the preset database is pre-established and is used for storing a plurality of data sets, each data set of the plurality of data sets comprises a dialect area, dialect voices belonging to the dialect area, and mandarin texts corresponding to the dialect voices, so that dialect texts corresponding to the transfer texts are determined based on the preset database, and the voice texts are converted into the dialect voices. In addition, in order to facilitate the user to understand dialect voice, dialect text can be further included in the data set, and dialect text can be synchronously displayed when the dialect voice is played.

Based on this, in an implementation manner of this embodiment, before the obtaining the speech text and determining the transfer intention corresponding to the speech text by using the pre-trained intention recognition model, the method further includes:

and S0, constructing a preset database.

Specifically, at least two data sets exist in the preset database, and dialect areas corresponding to the two data sets are different. That is, when the data groups in the preset database are divided according to the dialect area, at least two data sets can be obtained by dividing, and each data set at least comprises one data group. In a typical implementation, the preset database includes a plurality of data sets, each of the data sets includes a plurality of data sets, and the dialect areas in the data sets in each data set are the same, while the dialect areas in the data sets in different data sets are different. For example, the preset database includes a data set a, a data set B, and a data set C, the data set a includes a data group a1, a data group a2, and a data group a 3; data set B includes data group B1, data group B2, and data group B3; data set C includes data set C1, data set C2, and data set C3, then the dialect region in data set a1 is not the same as the dialect region in data set b1, and the dialect region in data set b1 is not the same as the dialect region in data set C1.

In one implementation of this embodiment, the dialect region may be divided into provinces or cities, for example, the cantonese dialect and the Sichuan dialect are divided into provinces, the Shenzhen dialect and the Chengdu dialect are divided into cities, and the like. Each dialect area in the preset database may be a common dialect area, for example, including a tetragon area, a guangdong dialect area, a Hunan dialect area, and the like, and the data set corresponding to each dialect area includes the common dialect of the dialect area, for example, how well we have eaten, i well like you, and the like.

In an implementation manner of this embodiment, the searching for the dialect text corresponding to the rephrase text in the preset database based on the dialect area and the rephrase text specifically includes:

Specifically, each parameter data set in all the reference data sets is included in the database, the dialect area in the reference data set is the same as the dialect area corresponding to the voice text, and the dialect area of any data set except the searched reference data set in the preset database is different from the dialect area corresponding to the voice text. In addition, the mandarin text matched with the rephrase text is the mandarin text included in one data group of all the reference data groups, and the mandarin text in the data group is matched with the rephrase text, wherein the matching means that the text content of the mandarin text in the data group is matched with the text content of the rephrase text, for example, the text content of the mandarin text in the data group is the same as the text content of the rephrase text, or the similarity between the text content of the mandarin text in the data group and the text content of the rephrase text is greater than a preset similarity threshold value and the like.

In an implementation manner of this embodiment, the searching for the mandarin text matching with the text in all the searched reference data sets specifically includes:

Specifically, the target mandarin text which is the same as the text content of the commentary text means that the text content of the target mandarin text is the same as the text content of the commentary text, for example, the text content of the commentary text is that i likes you well, then the text content of the target mandarin text is that i likes you well, and then the commentary text is the same as the content of the target mandarin text. The similarity is used for reflecting the matching degree of the text content of the rephrase text and the text content of the Mandarin text in the reference data set, wherein the higher the similarity is, the higher the matching degree of the text content of the rephrase text and the text content of the Mandarin text in the reference data set is, and conversely, the lower the similarity is, the lower the matching degree of the text content of the rephrase text and the text content of the Mandarin text in the reference data set is. Thus, the mandarin text corresponding to the commentary text can be determined based on the similarity. Correspondingly, in an implementation manner of this embodiment, the determining, based on the similarity, a mandarin text that matches the text of the commentary text specifically includes:

Specifically, the preset similarity threshold is preset, and is a basis for measuring whether the mandarin text matched with the text to be transcribed is determined, when the similarity between the text content of the mandarin text in the reference data set and the text content of the text to be transcribed is greater than the preset similarity threshold, it is determined that the mandarin text in the reference data set can be used as the mandarin text matched with the text to be transcribed. In addition, in practical application, the similarity between the text content of the mandarin text in the multiple candidate data sets and the text content of the transcribed text may be greater than a preset similarity threshold in all the reference data sets, and at this time, the mandarin text in the candidate data set with the maximum similarity in the candidate data sets is used as the mandarin text matched with the transcribed text, so that the accuracy of the dialect speech obtained by conversion may be improved. Of course, in other implementation manners, one candidate data group may be randomly selected from the multiple candidate data groups, and the mandarin text in the candidate data group is used as the mandarin text matched with the transcription text.

The default text is preset and is used for being used as a mandarin text matched with the rephrase text when the candidate data group is not selected, wherein the text content of the default text can be a bottom-bound text which is unknown or cannot be translated, and the like, so that the user can be informed of the fact that the voice text cannot be converted into dialect voice through the default text, and the problem of rephrase errors is avoided.

In an implementation manner of this embodiment, after the dialect speech is acquired, the dialect speech may be played through a speech playing device, so that a user knows the dialect speech corresponding to the speech text, where the dialect speech is stored in a data group where a mandarin text corresponding to the transcription text is located, and the data group further includes the dialect text, so that the dialect text may be synchronously displayed.

In summary, the present embodiment provides a dialect transferring method based on a speech model, where the method includes: acquiring a voice text, and determining a transfer intention corresponding to the voice text through a pre-trained intention recognition model; when the rephrasing intention is to rephrase the dialect, determining a dialect area and a rephrase text corresponding to the voice text through a pre-trained entity recognition model; and searching dialect voice corresponding to the transcription text in a preset database based on the dialect area and the transcription text so as to convert the voice text into the dialect voice. According to the dialect transferring method and device, the common dialects in all dialect areas are integrated through the preset database, then the intention recognition model and the entity recognition model are used for determining the transfer statement needing to be converted into the dialect to be transferred and the dialect area corresponding to the transfer statement, and finally the dialect voice corresponding to the transfer statement is selected from the preset database, so that the accuracy of transferring the dialects can be improved, and convenience is brought to the use of a user.

Based on the dialect rephrasing method based on the voice model, the application also provides a dialect rephrasing device based on the voice model, as shown in fig. 4, the device includes:

an obtaining module 100, configured to obtain a speech text, and determine a rephrasing intention corresponding to the speech text through a pre-trained intention recognition model;

a determining module 200, configured to determine, through a pre-trained entity recognition model, a dialect region and a transcription text corresponding to the voice text when the transcription intention is to transcribe the dialect;

a conversion module 300, configured to search, based on the dialect area and the transcription text, dialect voices corresponding to the transcription text in a preset database, so as to convert the voice text into dialect voices.

Based on the foregoing method for transferring dialect based on speech model, this embodiment provides a computer-readable storage medium, which stores one or more programs that can be executed by one or more processors to implement the steps in the method for transferring dialect based on speech model according to the foregoing embodiment.

Based on the dialect rephrasing method based on the voice model, the present application further provides a terminal device, as shown in fig. 5, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific working process of the training sample set obtaining device, the storage medium and the specific process loaded and executed by the multiple instruction processors in the terminal device are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for dialect transcription based on a speech model, the method comprising:

2. The method of claim 1, wherein before the obtaining the phonetic text and determining the transcription intent corresponding to the phonetic text by the pre-trained intent recognition model, the method further comprises:

3. The method of claim 2, wherein the intent recognition model and the entity recognition model are pre-trained bert models.

4. The dialect paraphrasing method based on the speech model of claim 1, wherein the searching for the dialect text corresponding to the paraphrase text in a preset database based on the dialect region and the paraphrase text specifically comprises:

5. The method as claimed in claim 4, wherein the step of searching for mandarin text matching with the transcription text in all the found reference data sets comprises:

6. The method of claim 5, wherein the determining the Mandarin Chinese text matching the transcription text based on the similarity specifically comprises:

7. The method of claim 1, wherein after searching dialect speech corresponding to the transcribed text in a preset database based on the dialect region and the transcribed text to convert the speech text into dialect speech, the method further comprises:

and playing the dialect voice through a voice playing device.

8. An apparatus for dialect transcription based on a speech model, the apparatus comprising:

9. A computer-readable storage medium storing one or more programs for execution by one or more processors to perform the steps of the method for dialect transfer based on a speech model of any one of claims 1-7.

10. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, performs the steps of the method of any of claims 1-7 for dialect transcription based on speech models.