WO2023226767A1

WO2023226767A1 - Model training method and apparatus, and speech meaning understanding method and apparatus

Info

Publication number: WO2023226767A1
Application number: PCT/CN2023/093289
Authority: WO
Inventors: 薛兰青; 应缜哲; 林金镇; 吴晓烽
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2022-05-23
Filing date: 2023-05-10
Publication date: 2023-11-30
Also published as: CN115019786A

Abstract

A fuzzy sound recognition model training method and apparatus, a speech meaning understanding method and apparatus, a computing device, and a computer readable storage medium. The fuzzy sound recognition model training method comprises: obtaining a sample text with semantic meaning comprising a plurality of characters (101); for each character in the sample text, acquiring pinyin of the character (103); on the basis of the pinyin of each character in the sample text, obtaining a fuzzy sound corresponding to each character (105); and, using the sample text, the fuzzy sound corresponding to each character in the sample text, and labels of the sample text, training a fuzzy sound recognition model (107). The present method enables speech meaning to be understood more accurately.

Description

Model training method and device and speech meaning understanding method and device

Technical field

The present application relates to electronic information technology, and in particular to methods and devices for training fuzzy sound recognition models, and methods and devices for understanding speech meaning.

Background technique

Currently, speech recognition technology is widely used. When applying speech recognition technology, the speech spoken by the user is usually recognized first, converted from speech to text, and then the meaning of the text is understood to obtain the meaning of the speech and perform related processing.

However, the current speech recognition technology is not yet mature. When converting from speech to text, recognition errors often occur. Based on the wrong text, the meaning of the speech cannot be accurately understood.

Contents of the invention

One or more embodiments of this specification describe methods and devices for training fuzzy sound recognition models and methods and devices for understanding speech meaning, which can more accurately understand the meaning of speech.

According to the first aspect, a training method for a fuzzy sound recognition model is provided, including: obtaining a sample text with semantics including multiple characters; for each character in the sample text, obtaining the pinyin of the character; according to the sample text The pinyin of each character is obtained to obtain the fuzzy sound corresponding to each character; the fuzzy sound recognition model is trained using the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text.

Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: judging whether the pinyin of each character in the sample text includes the first pronunciation; the first pronunciation conforms to: a The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced with the second pronunciation, and the replacement The resulting pinyin is used as the fuzzy pronunciation corresponding to the character.

Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: splitting the pinyin of each character in the sample text into initial consonants and finals; judging whether the split initial consonants are in the initial consonants. Including the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the Initial consonant correspondence of characters fuzzy sound; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if yes, use the first pronunciation of the finals Replace it with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.

For the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the beginning of the final; and determining whether the final includes the first pronunciation. One pronunciation includes: Determine whether the finals after deleting the rhyme include the first pronunciation.

The training of the fuzzy sound recognition model includes: for each character in the sample text, generating a triplet corresponding to the character. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the character. The fuzzy sound corresponding to the final of the character; according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.

The labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.

According to the second aspect, a method for understanding the meaning of speech is provided, including: obtaining a first text; the first text is generated after speech recognition; for each character in the first text, obtaining the character's Pinyin; according to the pinyin of each character, obtain the fuzzy sound corresponding to each character; input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second fuzzy sound recognition model output Text; understand the second text and obtain the meaning of the speech.

The method of obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the first text includes: judging whether the pinyin of each character in the first text includes the first pronunciation; the first pronunciation conforms to: The pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the second pronunciation. The pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.

Obtaining the fuzzy sound corresponding to each character based on the pinyin of each character includes: splitting the pinyin of each character in the first text into initial consonants and finals; and judging whether the split initial consonants include the first consonant. one pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character's The fuzzy sound corresponding to the initial consonant; for the split final, determine whether the final contains the first pronunciation; if not, use the final as the fuzzy sound corresponding to the final of the character directly; if so, use the final in the final as the fuzzy sound corresponding to the character. One pronunciation is replaced with a second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy pronunciation corresponding to the final of the character.

For the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the finals in the finals; determining whether the finals include the first pronunciation. Include: Determine whether the final sound after the deleted rhyme includes the first pronunciation.

The input of the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model includes: for each character in the first text, generating a triplet corresponding to the character, and the triplet The group includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound in turn. Identify the model.

According to the third aspect, a training device for a fuzzy sound recognition model is provided, including: a sample text acquisition module configured to obtain a semantic sample text including multiple characters; a pinyin acquisition module configured to obtain each character in the sample text A character, obtains the pinyin of the character; the fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text; the training execution module is configured to use the sample text, the sample text The fuzzy sound corresponding to each character and the label of the sample text are used to train the fuzzy sound recognition model.

According to a fourth aspect, a device for understanding speech meaning is provided, including: a speech recognition result receiving module configured to obtain a first text; the first text is generated after speech recognition of speech; a character pinyin generation module configured In order to obtain the pinyin of each character in the sample text; the character fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character; the input module is configured to convert the first text, The fuzzy sound corresponding to each character in the first text is input into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module is configured to understand the second text and obtain the meaning of the speech.

According to a fifth aspect, a computing device is provided, including a memory and a processor. The memory stores executable code. When the processor executes the executable code, it implements the method described in any embodiment of this specification. method.

The embodiments of this specification can train a fuzzy sound recognition model that can correct text errors in speech recognition, and based on this model, the meaning of speech can be more accurately understood.

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of this specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of this specification. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification.

Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification.

Figure 3 is a schematic structural diagram of a training device for a fuzzy sound recognition model in one embodiment of this specification.

Figure 4 is a schematic structural diagram of a device for understanding speech meaning in an embodiment of this specification.

Detailed ways

As mentioned before, recognition errors often occur when converting from speech to text. According to the wrong text, the meaning of the speech cannot be accurately understood. For example: For example, the machine-implemented intelligent customer service system asks the user: Are you buying physical or virtual items? The user uses voice to answer. The user originally hoped that the answer would be a physical object, but because the user uses a dialect, the voice recognition error occurs, and the recognized text is: four or five. In this way, the intelligent customer service system cannot understand the meaning of the user's voice based on the identified incorrect text, resulting in business errors.

The solutions provided in this specification will be described below in conjunction with the accompanying drawings.

An analysis of the speech recognition process shows that an important reason for speech recognition errors is that the user's speech pronunciation is not standard. People in different regions may use different dialects and confuse one pronunciation with another. For example, if the flat tongue sound is mixed with the raised tongue sound, people in some places will pronounce the "z, c, s" sounds in Pinyin as "zh, ch, sh", such as "bicycle" is pronounced as "zhi" xing che" driving. For another example, people in some places pronounce the sounds of "j, q, x" in Pinyin as "z, c, s", such as "learning" as "zin siu". Another example is that people in some places confuse the pronunciation of "f" and "h" in Pinyin, such as pronouncing "objection" as "huǎn dui".

Therefore, if the pronunciation of each character in the speech-recognized text can be corrected using the characteristics of pronunciation mixing, it can effectively solve the problem of speech recognition errors caused by users' non-standard pronunciation.

In order to correct speech recognition errors by utilizing the characteristics of pronunciation mixing, in one embodiment of this specification, a fuzzy sound recognition model can be pre-trained. In actual business applications, the fuzzy sound recognition model can be used to correct speech recognition errors. text to more accurately understand the meaning of the speech.

The methods in the embodiments of this specification can be applied to various speech recognition application scenarios. For example, include the following scenarios one to three.

Scenario 1. Intelligent customer service system

After the user inputs a piece of speech through the phone or the Internet, the intelligent customer service system (such as the machine customer service of the Alipay platform) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model provided by the embodiment of this specification. And the method of understanding the meaning of speech can correct the errors in the text recognized by speech and obtain the correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's speech, such as whether the user is purchasing physical or virtual items. , need to return or exchange, etc.

Scenario 2. Artificial Intelligence System

After the user sends a piece of speech through live conversation, telephone or Internet, the artificial intelligence system (such as a robot) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model and the method for understanding the meaning of the speech provided by the embodiments of this specification. , artificial intelligence systems (such as robots) can correct errors in text recognized by speech, and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as ordering the robot to change its walking route.

Scenario 3. Smart home system based on the Internet of Things

After the user sends a piece of speech through live conversation, telephone or network, the smart home system (such as a smart TV) will perform speech recognition, recognize a piece of text, and apply the fuzzy sound recognition model provided by the embodiments of this specification to understand the meaning of the speech. Method, smart home systems (such as smart TVs) can correct text errors in speech recognition and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as commanding the smart TV to start recording. TV shows at a certain time, etc.

The following will describe the implementation of the embodiments of this specification in two aspects. The first aspect describes the training method of the fuzzy sound recognition model, and the second aspect describes the method of understanding the meaning of speech.

First, in the first aspect, the training method of the fuzzy sound recognition model is explained.

Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification. The execution subject of this method is the training device of the fuzzy sound recognition model. It can be understood that this method can also be executed by any device, device, platform, or device cluster with computing and processing capabilities. Referring to Figure 1, the method includes: Step 101: Obtain a semantic sample text including multiple characters; Step 103: For each character in the sample text, obtain the pinyin of the character; Step 105: According to each character in the sample text The pinyin of a character is used to obtain the fuzzy sound corresponding to each character; Step 107: Use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.

It can be seen that in the process shown in Figure 1 above, when training the fuzzy sound recognition model, considering that the user's pronunciation is not standard, the pronunciation of a character such as a Chinese character will be mixed with the pronunciation of other Chinese characters. Therefore, a The concept of fuzzy sound, so that no matter what accent or pronunciation method the user uses, the fuzzy sound can be used to unify the two characters that are mixed due to different pronunciations into the same fuzzy sound pronunciation, thus making the fuzzy sound recognition model study By understanding the pronunciation of the mixed characters and combining it with the context of the sample text, the correct characters are obtained, so that this fuzzy speech recognition model can be used to subsequently correct errors in speech recognition texts.

Each step shown in Figure 1 will be described below with specific examples.

First, in step 101: Obtain a semantic sample text including multiple characters.

In order to train the fuzzy sound recognition model, sample text is required. The sample text can be any type of semantic text, such as an article, a user complaint text, a product description text, etc. In order to allow the fuzzy sound recognition model to learn various situations of pronunciation errors and corresponding character errors caused by the user's accent, the sample text should include at least one character formed by non-standard pronunciation (user's accent or dialect). For example, the sample text includes "...to drive...,...to make a road". Due to the user's accent, "zi" in bicycle will be pronounced as "zhi", and "zhi" corresponds to the non-standard pronunciation of "to". At the same time, due to the user's accent, "nu" in anger will be pronounced as "lu", and "lu" corresponds to the character formed by the non-standard pronunciation of "lu".

In the embodiment of this specification, the characters may include: at least one of Chinese characters, English letters, and punctuation marks.

In the embodiment of this specification, the sample text has a label, which may be a label given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension. For example, the label represents that the emotion expressed in the sample text is anger; the label represents that the sample text belongs to the field of user complaints; the label represents that the meaning of the sample text is that the user purchased physical items, etc., so that the fuzzy sound recognition model can identify the content of the sample text based on the label. Whether each character and its fuzzy sounds are learned correctly.

Next, step 103: for each character in the sample text, obtain the pinyin of the character.

Here, the standard pinyin of each character can be obtained from the dictionary.

For example, the sample text includes "...to drive...,...to send the road". In this step 103, for the character "to", the pinyin is obtained as "zhi"; for the character "行", the pinyin is obtained as "xing"; for the character "car", obtain its pinyin as "che"; for the character "路", obtain its pinyin as "lu".

Next, step 105: According to the pinyin of each character in the sample text, obtain the fuzzy pronunciation corresponding to each character.

In the embodiment of this specification, the concept of fuzzy sound is designed. The fuzzy sound corresponding to the pinyin of a character is consistent with: when the pronunciation of the pinyin does not include the first pronunciation that is easily confused, the pinyin is the same as the fuzzy sound, and the pinyin is the same as the fuzzy sound. When the pronunciation includes the first pronunciation, the pronunciation of the pinyin is mixed with the pronunciation of the fuzzy pronunciation. In this way, no matter what accent or pronunciation method the user uses, the fuzzy pronunciation can be used to distinguish the two characters that are confused due to different pronunciations. unified into same The pronunciation of a fuzzy sound allows the fuzzy sound recognition model to learn the pronunciation of the mixed-pronounced character.

As mentioned before, due to user accent or dialect, users often confuse the pronunciation of one pronunciation with the pronunciation of another pronunciation. For example, the pronunciations of "z, c, s" are confused with the pronunciations of "zh, ch, ch" respectively. The pronunciation of "sh" is confused, the pronunciation of "ing" is confused with the pronunciation of "in", the pronunciation of "f" is confused with the pronunciation of "h", etc. Therefore, the corresponding relationship between the first pronunciation and the second pronunciation that are likely to be confused with each other in pronunciation can be set in advance to clarify which pairs of pronunciations are likely to be confused with each other. For example, the corresponding relationship between the first pronunciation and the second pronunciation is recorded in Table 1 below.

Table 1

In the above Table 1, the first pronunciation is usually the pronunciation of users with accents or dialects, and the second pronunciation is the original pronunciation of the characters. It can be understood that the above Table 1 is only schematic. In actual services, different correspondences between the first pronunciation and the second pronunciation can be set according to different application locations, that is, different accent characteristics of users. After setting the corresponding relationship such as that shown in Table 1, the corresponding relationship can be used to replace the pinyin of the characters in the sample text with fuzzy sounds.

Step 105 includes the following two implementation methods: Method 1: Use one pinyin as a unit to replace the fuzzy sounds of pinyin. Method 2: Use an initial consonant and a final vowel as a unit to replace the fuzzy sounds of Pinyin.

Method 1 will be described first: In one embodiment of this specification, the specific implementation process of step 105 based on method 1 includes: Step 1051A: For the pinyin of each character in the sample text, determine whether the pinyin includes the first pronunciation; One pronunciation matches: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation. If not, step 1053A will be executed. If yes, step 1055A will be executed.

Step 1053A: Use the pinyin directly as the fuzzy sound corresponding to the character.

Step 1055A: Replace the first pronunciation in the pinyin with the second pronunciation, and the resulting pinyin after replacement is used as the fuzzy pronunciation corresponding to the character.

The above process from step 1051A to step 1055A is described with an example. For example, in step 103 above, for each character "...to driving...,...falu" included in the sample text, the pinyin corresponding to each character is obtained as "zhi", "xing", "che", and "lu" respectively. ". In this way, in the process from step 1051A to step 1055A, first of all, for the pinyin "zhi" of the character "to", because the pinyin "zhi" includes a first pronunciation "zh" in Table 1, therefore, the pinyin is The first pronunciation "zh" is replaced with its corresponding second pronunciation "z", and the pinyin "zi" obtained after the replacement is used as the fuzzy sound corresponding to the character "to"; next, the pinyin "xing" of the character "行" , because the pinyin "xing" does not include any of the first pronunciations in Table 1, so the pinyin "xing" is directly used as the fuzzy sound corresponding to the character "行"; and by analogy, for the pinyin "lu" of the character "路" ", because the pinyin "lu" includes the first pronunciation "l" in Table 1, therefore, the first pronunciation "l" in the pinyin is replaced with its corresponding second pronunciation "n", and the replacement is obtained The pinyin "nu" is the fuzzy sound corresponding to the character "路".

In this way, after the processing from step 1051A to step 1055A, the fuzzy sounds corresponding to each character obtained include: "...zi xing che,...fa nu".

Method 2 is described below: In order to improve training efficiency and reduce training difficulty, in one embodiment of this specification, method 2 can be adopted, that is, split Pinyin into initial consonants and finals, and then determine whether the initial consonants include The first pronunciation, and determine whether the finals include the first pronunciation, and then replace the fuzzy sounds respectively. At this time, Step 105 specifically includes: Step 1051B: Split the pinyin of each character in the sample text into initial consonants and finals; Step 1053B: For the split initial consonants, determine whether the initial consonants include the first pronunciation; if not, The initial consonant is directly used as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the initial consonant of the character; Step 1055B : For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation of the finals with this The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.

In one embodiment of this specification, in order to further improve training efficiency and reduce training difficulty, the finals can be further simplified, that is, the rhyme part in the finals can be deleted. For example, for the pinyin "guang", the finals include: "uang", the final "u" in the final has a relatively small contribution to the pronunciation of the pinyin "guang", while the final "ang" contributes a lot to the pronunciation of the pinyin "guang" It has a relatively large contribution and is a key part of the pronunciation of finals. Therefore, in order to improve efficiency, the impact of this final on pronunciation can be ignored during the training process. In this way, in the above step 1055B, before judging whether the final sounds include the first pronunciation, it further includes: judging whether the final sounds include the beginning and the end of the rhyme, If yes, then delete the final in the final; then, in step 1055B, determine whether the final after deleting the final includes the first pronunciation; if not, use the final after deleting the final as the character. The fuzzy sound corresponding to the final; if so, replace the first pronunciation in the final with the deleted rhyme with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.

Next, step 107: use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.

Still using the above example, if step 105 adopts the above method one, the information input to the fuzzy sound recognition model in step 107 includes: "...to (zi) line (xing) car (che)...,...fa (fa) Road (nu)" and the label of the sample text such as "traffic accident dispute".

If step 105 adopts the second method above, the specific implementation process of step 107 includes: Step 1071: For each character in the sample text, generate a triplet corresponding to the character. The triplet includes: the character, the The fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character; Step 1073: According to the order of each character in the sample text, input the triplet and label corresponding to each character into the fuzzy sound recognition model to be trained. .

Comparing Method 1 and Method 2, normally, there are 23 initial consonants and 24 finals. So, when implemented using Method 1, for the fuzzy sound recognition model, in order to learn fuzzy sounds, a total of 23*24 unknowns are needed. study. When implemented using the second method, for the fuzzy sound recognition model, in order to learn fuzzy sounds, there are a total of 23+24 unknowns that need to be learned. It can be seen that the second method can greatly improve the training efficiency of the fuzzy sound recognition model and reduce the training difficulty.

Regardless of whether method one or two is adopted, because the fuzzy sound recognition model has learned from other characters in the training process that the pronunciation of "自" is "zi" and the pronunciation of "nu" is "nu", combined with the context of the sample text and tag, the fuzzy sound recognition model can correct the sample text based on the input information, for example, correct it to...bicycle...,...angry" to get the correct meaning.

It can be understood that the training of the fuzzy sound recognition model will be conducted for multiple rounds, using multiple sample texts for training. Refer to the above embodiment for the training process of each round until the fuzzy sound recognition model converges.

After training the fuzzy sound recognition model, the fuzzy sound recognition model can be used to understand the meaning of the speech.

The following describes the second aspect, the method of understanding the meaning of speech.

Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification. The execution subject of this method is a device for understanding speech meaning. It can be understood that this method can also be implemented by any device with computing and processing capabilities. devices, platforms, and device clusters to execute. Referring to Figure 2, the method includes steps 201 to 209.

Step 201: Obtain the first text; the first text is generated after speech recognition.

Step 203: For each character in the first text, obtain the pinyin of the character.

Step 205: According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.

Step 207: Input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; Step 209: Understand the second text and obtain the speech meaning.

It can be seen that in the process shown in Figure 2 above, no matter what accent or pronunciation method the user uses, the fuzzy sound can be used to unify the two characters that are mixed due to different pronunciations into the same fuzzy sound. In this way By combining the fuzzy sound recognition model with the context of the first text, we can get the characters that the user really needs for his speech, that is, get the second text that reflects the real semantics, so that we can correct the errors of the first text in speech recognition. According to the second text This allows the machine to more accurately understand the meaning of the user’s voice.

Each step in Figure 2 is explained below.

First, step 201: obtain the first text; the first text is generated after speech recognition of speech.

Here, the first text is generated after speech recognition of the user's voice in an actual application scenario. For example, the user inputs a piece of speech to the intelligent customer service system, and the speech recognition system recognizes the speech, thereby obtaining the first text.

Next, step 203: for each character in the first text, obtain the pinyin of the character.

Here, you can find the pinyin of each character in the dictionary.

Next, step 205: According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.

This step 205 can also be implemented using the above-mentioned method 1 and method 2.

When using the first method, in one embodiment of this specification, the implementation process of step 205 includes: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if yes, the first pronunciation of the pinyin will be replaced by the second pronunciation. The obtained pinyin is used as the fuzzy pronunciation corresponding to the character.

When using the second method, in one embodiment of this specification, the implementation process of step 205 includes: splitting the pinyin of each character in the first text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants include The first pronunciation; if not, use the initial consonant directly as the fuzzy pronunciation corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character The fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, use the finals in the finals to determine whether the first pronunciation is included in the finals. The first pronunciation is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.

When using the second method, for the separated finals, before determining whether the finals include the first pronunciation, it further includes: determining whether the finals include the beginning and end of the rhyme, and if so, delete the finals in the final; determining whether the finals include the first pronunciation. Whether the first pronunciation is included in: Determine whether the final after deleting the rhyme includes the first pronunciation.

For the specific implementation process of step 205, please refer to all relevant descriptions of step 105 mentioned above, and the processing ideas are the same.

Next, step 207: input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; when step 205 is implemented using the second method, this The process of step 207 includes: for each character in the first text, generate a triplet corresponding to the character. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character and the final sound corresponding to the character. Fuzzy sound; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound recognition model to be trained.

For the relevant description and understanding of step 207, please refer to the above-mentioned description of step 107, and the processing ideas are the same.

In one embodiment of this specification, a training device for a fuzzy sound recognition model is provided. See Figure 3 , which includes: a sample text acquisition module 301 configured to obtain a semantic sample text including multiple characters;

The pinyin acquisition module 302 is configured to obtain the pinyin of each character in the sample text; the fuzzy sound generation module 303 is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text; The training execution module 304 is configured to use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.

In one embodiment of the device of this description, the fuzzy sound generation module 303 is configured to perform the following operations: for the pinyin of each character in the sample text, determine whether the pinyin includes a first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be used directly as the character The corresponding fuzzy sound; if so, replace the first pronunciation in the pinyin with the second pronunciation, and the pinyin obtained after the replacement is used as the corresponding fuzzy sound of the character.

In another embodiment of the device described in this specification, the fuzzy sound generation module 303 is configured to perform the following operations: split the pinyin of each character in the sample text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants are in Including the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the The fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, add the finals to The first pronunciation of is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.

In one embodiment of the device of this specification, the fuzzy sound generation module 303 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the beginning and the end of the rhyme. If If yes, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.

In one embodiment of the device of this description, the training execution module 304 is configured to execute: for each character in the sample text, generate a triplet corresponding to the character, the triplet including: the character, the initial consonant of the character The corresponding fuzzy sound and the fuzzy sound corresponding to the final of the character; according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.

In one embodiment of the apparatus of this specification, the labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.

In one embodiment of this specification, a device for understanding speech meaning is provided. See Figure 4, which includes: a speech recognition result receiving module 401 configured to obtain a first text; the first text is obtained after speech recognition of speech. Generated; the character pinyin generation module 402 is configured to obtain the pinyin of each character in the first text; the character fuzzy sound generation module 403 is configured to obtain the fuzzy corresponding to each character based on the pinyin of each character. sound; the input module 404 is configured to input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module 405 is configured as The second text is understood to obtain the meaning of the speech.

In the speech meaning understanding device of one embodiment of this specification, the character fuzzy sound generation module 403 is configured to perform the following operations: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; The pronunciation matches: the pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the pronunciation of the first pronunciation. Second pronunciation, The pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.

In another embodiment of the present specification, the phonetic meaning understanding device, the character fuzzy sound generation module 403 is configured to perform the following operations: split the pinyin of each character in the first text into initial consonants and final consonants; , determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the first pronunciation corresponding to the first pronunciation. The second pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, the final is directly used as the fuzzy sound corresponding to the finals of the character; if so , then the first pronunciation in the final is replaced with the second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy sound corresponding to the final of the character.

In the phonetic meaning understanding device of the embodiment of this specification, the character fuzzy sound generation module 403 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the rhyme and the final pronunciation. At the end of the rhyme, if so, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.

In the speech meaning understanding device of the embodiment of this specification, the input module 404 is configured to perform the following operations: for each character in the first text, generate a triplet corresponding to the character, where the triplet includes: the character , the fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character; according to the order of each character in the first text, the triplets corresponding to each character are sequentially input into the fuzzy sound recognition model.

One embodiment of the present specification provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the method in any embodiment of the specification.

One embodiment of this specification provides a computing device, including a memory and a processor. The memory stores executable code. When the processor executes the executable code, it implements any of the embodiments in the specification. method.

It can be understood that the structures illustrated in the embodiments of this specification do not constitute specific limitations on the devices of the embodiments of this specification. In other embodiments of the specification, the above-mentioned device may include more or less components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently. The components illustrated may be implemented in hardware, software, or a combination of software and hardware.

The information interaction, execution process, etc. between the above-mentioned devices and modules in the system are based on the same concept as the method embodiments in this specification. For details, please refer to the description in the method embodiments in this specification, and will not be described again here.

Each embodiment in this specification is described in a progressive manner, and the same and similar features among the various embodiments Parts may refer to each other, and each embodiment focuses on its differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

Those skilled in the art should realize that in one or more of the above examples, the functions described in this disclosure can be implemented using hardware, software, plugins, or any combination thereof. When implemented using software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present disclosure in detail. It should be understood that the above-mentioned are only specific embodiments of the present disclosure and are not intended to limit the scope of the present disclosure. Protection scope: Any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solution of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

A training method for a fuzzy sound recognition model, including:

Obtain a semantic sample text including multiple characters;

For each character in the sample text, obtain the pinyin of the character;

According to the pinyin of each character in the sample text, the fuzzy sound corresponding to each character is obtained;

The fuzzy sound recognition model is trained using the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text.
The method according to claim 1, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the sample text includes:

For the pinyin of each character in the sample text, determine whether the pinyin includes the first pronunciation; the first pronunciation meets: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation;

If not, use the pinyin directly as the fuzzy sound corresponding to the character;

If so, the first pronunciation in the pinyin is replaced with the second pronunciation, and the resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
The method according to claim 1, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the sample text includes:

Split the pinyin of each character in the sample text into initial consonants and finals;

For the split initial consonant, determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character;

For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation in the finals with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
The method according to claim 3, wherein, for the separated finals, before determining whether the finals include the first pronunciation, further comprising: determining whether the finals include the beginning and the end of the rhyme, and if so, deleting the final rhyme; rhyme;

Determining whether the final pronunciation includes the first pronunciation includes: determining whether the final pronunciation after deleting the final pronunciation includes the first pronunciation.
The method according to claim 3, wherein training the fuzzy sound recognition model includes:

For each character in the sample text, a triplet corresponding to the character is generated. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character;

According to the order of each character in the sample text, the triplet corresponding to each character and the label are sequentially input into the fuzzy sound recognition model to be trained.
According to the method of claim 1, the label of the sample text includes: a label given from at least one dimension of the emotion dimension, the domain dimension, the subject dimension, and the text meaning dimension.
A method of understanding the meaning of speech, including:

Obtain the first text; the first text is generated after performing speech recognition on the speech;

For each character in the first text, obtain the pinyin of the character;

According to the pinyin of each character, the fuzzy sound corresponding to each character is obtained;

Input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second text output by the fuzzy sound recognition model;

The second text is understood to obtain the meaning of the speech.
The method according to claim 7, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the first text includes:

For the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; the first pronunciation meets: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation;

If not, use the pinyin directly as the fuzzy sound corresponding to the character;

If so, the first pronunciation in the pinyin is replaced with the second pronunciation, and the resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
The method according to claim 7, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character includes:

Split the pinyin of each character in the first text into initial consonants and finals;

For the split initial consonant, determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with this The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character;

For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation in the finals with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
The method according to claim 9, wherein, for the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and the end of the rhyme, and if so, deleting the finals in the finals. rhyme; rhyme;

Determining whether the final sound includes the first pronunciation: determining whether the final sound after deleting the final sound includes the first pronunciation.
The method according to claim 9, wherein said inputting the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model includes:

For each character in the first text, a triplet corresponding to the character is generated. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character;

According to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound recognition model in turn.
A training device for fuzzy sound recognition model, including:

A sample text acquisition module configured to obtain a semantic sample text including multiple characters;

The pinyin acquisition module is configured to obtain the pinyin of each character in the sample text;

The fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text;

The training execution module is configured to use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
A device for understanding speech meaning, including:

A speech recognition result receiving module configured to obtain a first text; the first text is generated after speech recognition of speech;

The character pinyin generation module is configured to obtain the pinyin of each character in the sample text;

The character fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character;

The input module is configured to input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second text output by the fuzzy sound recognition model;

The speech meaning understanding module is configured to understand the second text and obtain the meaning of the speech.
A computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the method described in any one of claims 1-11.
A computing device includes a memory and a processor. The memory stores executable code. When the processor executes the executable code, the method according to any one of claims 1-11 is implemented.