WO2023226767A1 - Model training method and apparatus, and speech meaning understanding method and apparatus - Google Patents

Model training method and apparatus, and speech meaning understanding method and apparatus Download PDF

Info

Publication number
WO2023226767A1
WO2023226767A1 PCT/CN2023/093289 CN2023093289W WO2023226767A1 WO 2023226767 A1 WO2023226767 A1 WO 2023226767A1 CN 2023093289 W CN2023093289 W CN 2023093289W WO 2023226767 A1 WO2023226767 A1 WO 2023226767A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
pronunciation
text
pinyin
fuzzy
Prior art date
Application number
PCT/CN2023/093289
Other languages
French (fr)
Chinese (zh)
Inventor
薛兰青
应缜哲
林金镇
吴晓烽
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023226767A1 publication Critical patent/WO2023226767A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present application relates to electronic information technology, and in particular to methods and devices for training fuzzy sound recognition models, and methods and devices for understanding speech meaning.
  • speech recognition technology is widely used.
  • speech recognition technology the speech spoken by the user is usually recognized first, converted from speech to text, and then the meaning of the text is understood to obtain the meaning of the speech and perform related processing.
  • One or more embodiments of this specification describe methods and devices for training fuzzy sound recognition models and methods and devices for understanding speech meaning, which can more accurately understand the meaning of speech.
  • a training method for a fuzzy sound recognition model including: obtaining a sample text with semantics including multiple characters; for each character in the sample text, obtaining the pinyin of the character; according to the sample text The pinyin of each character is obtained to obtain the fuzzy sound corresponding to each character; the fuzzy sound recognition model is trained using the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text.
  • Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: judging whether the pinyin of each character in the sample text includes the first pronunciation; the first pronunciation conforms to: a The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced with the second pronunciation, and the replacement The resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
  • Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: splitting the pinyin of each character in the sample text into initial consonants and finals; judging whether the split initial consonants are in the initial consonants.
  • the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the beginning of the final; and determining whether the final includes the first pronunciation.
  • One pronunciation includes: Determine whether the finals after deleting the rhyme include the first pronunciation.
  • the training of the fuzzy sound recognition model includes: for each character in the sample text, generating a triplet corresponding to the character.
  • the triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the character.
  • the fuzzy sound corresponding to the final of the character according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.
  • the labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.
  • a method for understanding the meaning of speech including: obtaining a first text; the first text is generated after speech recognition; for each character in the first text, obtaining the character's Pinyin; according to the pinyin of each character, obtain the fuzzy sound corresponding to each character; input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second fuzzy sound recognition model output Text; understand the second text and obtain the meaning of the speech.
  • the method of obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the first text includes: judging whether the pinyin of each character in the first text includes the first pronunciation; the first pronunciation conforms to: The pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the second pronunciation.
  • the pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.
  • Obtaining the fuzzy sound corresponding to each character based on the pinyin of each character includes: splitting the pinyin of each character in the first text into initial consonants and finals; and judging whether the split initial consonants include the first consonant.
  • one pronunciation if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character's
  • the fuzzy sound corresponding to the initial consonant; for the split final determine whether the final contains the first pronunciation; if not, use the final as the fuzzy sound corresponding to the final of the character directly; if so, use the final in the final as the fuzzy sound corresponding to the character.
  • One pronunciation is replaced with a second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy pronunciation corresponding to the final of the character.
  • the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the finals in the finals; determining whether the finals include the first pronunciation. Include: Determine whether the final sound after the deleted rhyme includes the first pronunciation.
  • the input of the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model includes: for each character in the first text, generating a triplet corresponding to the character, and the triplet
  • the group includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound in turn. Identify the model.
  • a training device for a fuzzy sound recognition model including: a sample text acquisition module configured to obtain a semantic sample text including multiple characters; a pinyin acquisition module configured to obtain each character in the sample text A character, obtains the pinyin of the character; the fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text; the training execution module is configured to use the sample text, the sample text The fuzzy sound corresponding to each character and the label of the sample text are used to train the fuzzy sound recognition model.
  • a device for understanding speech meaning including: a speech recognition result receiving module configured to obtain a first text; the first text is generated after speech recognition of speech; a character pinyin generation module configured In order to obtain the pinyin of each character in the sample text; the character fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character; the input module is configured to convert the first text, The fuzzy sound corresponding to each character in the first text is input into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module is configured to understand the second text and obtain the meaning of the speech.
  • a computing device including a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, it implements the method described in any embodiment of this specification. method.
  • the embodiments of this specification can train a fuzzy sound recognition model that can correct text errors in speech recognition, and based on this model, the meaning of speech can be more accurately understood.
  • Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification.
  • Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification.
  • Figure 3 is a schematic structural diagram of a training device for a fuzzy sound recognition model in one embodiment of this specification.
  • Figure 4 is a schematic structural diagram of a device for understanding speech meaning in an embodiment of this specification.
  • recognition errors often occur when converting from speech to text. According to the wrong text, the meaning of the speech cannot be accurately understood. For example: For example, the machine-implemented intelligent customer service system asks the user: Are you buying physical or virtual items? The user uses voice to answer. The user originally hoped that the answer would be a physical object, but because the user uses a dialect, the voice recognition error occurs, and the recognized text is: four or five. In this way, the intelligent customer service system cannot understand the meaning of the user's voice based on the identified incorrect text, resulting in business errors.
  • a fuzzy sound recognition model can be pre-trained.
  • the fuzzy sound recognition model can be used to correct speech recognition errors. text to more accurately understand the meaning of the speech.
  • the methods in the embodiments of this specification can be applied to various speech recognition application scenarios. For example, include the following scenarios one to three.
  • the intelligent customer service system (such as the machine customer service of the Alipay platform) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model provided by the embodiment of this specification. And the method of understanding the meaning of speech can correct the errors in the text recognized by speech and obtain the correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's speech, such as whether the user is purchasing physical or virtual items. , need to return or exchange, etc.
  • the artificial intelligence system (such as a robot) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model and the method for understanding the meaning of the speech provided by the embodiments of this specification.
  • artificial intelligence systems can correct errors in text recognized by speech, and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as ordering the robot to change its walking route.
  • the smart home system (such as a smart TV) will perform speech recognition, recognize a piece of text, and apply the fuzzy sound recognition model provided by the embodiments of this specification to understand the meaning of the speech.
  • smart home systems can correct text errors in speech recognition and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as commanding the smart TV to start recording.
  • TV shows at a certain time, etc.
  • the first aspect describes the training method of the fuzzy sound recognition model
  • the second aspect describes the method of understanding the meaning of speech.
  • Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification.
  • the execution subject of this method is the training device of the fuzzy sound recognition model. It can be understood that this method can also be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the method includes: Step 101: Obtain a semantic sample text including multiple characters; Step 103: For each character in the sample text, obtain the pinyin of the character; Step 105: According to each character in the sample text The pinyin of a character is used to obtain the fuzzy sound corresponding to each character; Step 107: Use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
  • step 101 Obtain a semantic sample text including multiple characters.
  • sample text is required.
  • the sample text can be any type of semantic text, such as an article, a user complaint text, a product description text, etc.
  • the sample text should include at least one character formed by non-standard pronunciation (user's accent or dialect).
  • the sample text includes “...to drive...,...to make a road”. Due to the user's accent, “zi” in bicycle will be pronounced as “zhi”, and “zhi” corresponds to the non-standard pronunciation of "to”. At the same time, due to the user's accent, "nu” in anger will be pronounced as “lu”, and “lu” corresponds to the character formed by the non-standard pronunciation of "lu”.
  • the characters may include: at least one of Chinese characters, English letters, and punctuation marks.
  • the sample text has a label, which may be a label given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.
  • the label represents that the emotion expressed in the sample text is anger; the label represents that the sample text belongs to the field of user complaints; the label represents that the meaning of the sample text is that the user purchased physical items, etc., so that the fuzzy sound recognition model can identify the content of the sample text based on the label. Whether each character and its fuzzy sounds are learned correctly.
  • step 103 for each character in the sample text, obtain the pinyin of the character.
  • the standard pinyin of each character can be obtained from the dictionary.
  • the sample text includes "...to drive...,...to send the road”.
  • the pinyin is obtained as "zhi"; for the character “ ⁇ ”, the pinyin is obtained as "xing"; for the character “car”, obtain its pinyin as “che”; for the character “ ⁇ ”, obtain its pinyin as "lu”.
  • step 105 According to the pinyin of each character in the sample text, obtain the fuzzy pronunciation corresponding to each character.
  • the concept of fuzzy sound is designed.
  • the fuzzy sound corresponding to the pinyin of a character is consistent with: when the pronunciation of the pinyin does not include the first pronunciation that is easily confused, the pinyin is the same as the fuzzy sound, and the pinyin is the same as the fuzzy sound.
  • the pronunciation includes the first pronunciation, the pronunciation of the pinyin is mixed with the pronunciation of the fuzzy pronunciation. In this way, no matter what accent or pronunciation method the user uses, the fuzzy pronunciation can be used to distinguish the two characters that are confused due to different pronunciations. unified into same
  • the pronunciation of a fuzzy sound allows the fuzzy sound recognition model to learn the pronunciation of the mixed-pronounced character.
  • the pronunciations of "z, c, s” are confused with the pronunciations of "zh, ch, ch” respectively.
  • the pronunciation of "sh” is confused, the pronunciation of "ing” is confused with the pronunciation of "in”, the pronunciation of "f” is confused with the pronunciation of "h”, etc. Therefore, the corresponding relationship between the first pronunciation and the second pronunciation that are likely to be confused with each other in pronunciation can be set in advance to clarify which pairs of pronunciations are likely to be confused with each other. For example, the corresponding relationship between the first pronunciation and the second pronunciation is recorded in Table 1 below.
  • the first pronunciation is usually the pronunciation of users with accents or dialects
  • the second pronunciation is the original pronunciation of the characters. It can be understood that the above Table 1 is only schematic. In actual services, different correspondences between the first pronunciation and the second pronunciation can be set according to different application locations, that is, different accent characteristics of users. After setting the corresponding relationship such as that shown in Table 1, the corresponding relationship can be used to replace the pinyin of the characters in the sample text with fuzzy sounds.
  • Step 105 includes the following two implementation methods: Method 1: Use one pinyin as a unit to replace the fuzzy sounds of pinyin. Method 2: Use an initial consonant and a final vowel as a unit to replace the fuzzy sounds of Pinyin.
  • Step 1051A For the pinyin of each character in the sample text, determine whether the pinyin includes the first pronunciation; One pronunciation matches: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation. If not, step 1053A will be executed. If yes, step 1055A will be executed.
  • Step 1053A Use the pinyin directly as the fuzzy sound corresponding to the character.
  • Step 1055A Replace the first pronunciation in the pinyin with the second pronunciation, and the resulting pinyin after replacement is used as the fuzzy pronunciation corresponding to the character.
  • step 1051A for each character "...to driving...,...falu” included in the sample text, the pinyin corresponding to each character is obtained as “zhi”, “xing", “che”, and “lu” respectively. ".
  • the fuzzy sounds corresponding to each character obtained include: "...zi xing che,...fa nu”.
  • Method 2 is described below: In order to improve training efficiency and reduce training difficulty, in one embodiment of this specification, method 2 can be adopted, that is, split Pinyin into initial consonants and finals, and then determine whether the initial consonants include The first pronunciation, and determine whether the finals include the first pronunciation, and then replace the fuzzy sounds respectively.
  • Step 105 specifically includes: Step 1051B: Split the pinyin of each character in the sample text into initial consonants and finals; Step 1053B: For the split initial consonants, determine whether the initial consonants include the first pronunciation; if not, The initial consonant is directly used as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the initial consonant of the character; Step 1055B : For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation of the finals with this The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
  • the finals in order to further improve training efficiency and reduce training difficulty, the finals can be further simplified, that is, the rhyme part in the finals can be deleted.
  • the finals include: “uang”, the final “u” in the final has a relatively small contribution to the pronunciation of the pinyin "guang”, while the final “ang” contributes a lot to the pronunciation of the pinyin "guang” It has a relatively large contribution and is a key part of the pronunciation of finals. Therefore, in order to improve efficiency, the impact of this final on pronunciation can be ignored during the training process.
  • step 1055B before judging whether the final sounds include the first pronunciation, it further includes: judging whether the final sounds include the beginning and the end of the rhyme, If yes, then delete the final in the final; then, in step 1055B, determine whether the final after deleting the final includes the first pronunciation; if not, use the final after deleting the final as the character.
  • the fuzzy sound corresponding to the final if so, replace the first pronunciation in the final with the deleted rhyme with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
  • step 107 use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
  • step 105 adopts the above method one, the information input to the fuzzy sound recognition model in step 107 includes: "...to (zi) line (xing) car (che)...,...fa (fa) Road (nu)" and the label of the sample text such as "traffic accident dispute”.
  • step 105 adopts the second method above, the specific implementation process of step 107 includes: Step 1071: For each character in the sample text, generate a triplet corresponding to the character.
  • the triplet includes: the character, the The fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character;
  • Step 1073 According to the order of each character in the sample text, input the triplet and label corresponding to each character into the fuzzy sound recognition model to be trained. .
  • the fuzzy sound recognition model can correct the sample text based on the input information, for example, correct it to...bicycle...,...angry” to get the correct meaning.
  • the training of the fuzzy sound recognition model will be conducted for multiple rounds, using multiple sample texts for training. Refer to the above embodiment for the training process of each round until the fuzzy sound recognition model converges.
  • the fuzzy sound recognition model After training the fuzzy sound recognition model, the fuzzy sound recognition model can be used to understand the meaning of the speech.
  • Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification.
  • the execution subject of this method is a device for understanding speech meaning. It can be understood that this method can also be implemented by any device with computing and processing capabilities. devices, platforms, and device clusters to execute. Referring to Figure 2, the method includes steps 201 to 209.
  • Step 201 Obtain the first text; the first text is generated after speech recognition.
  • Step 203 For each character in the first text, obtain the pinyin of the character.
  • Step 205 According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.
  • Step 207 Input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model;
  • Step 209 Understand the second text and obtain the speech meaning.
  • the fuzzy sound can be used to unify the two characters that are mixed due to different pronunciations into the same fuzzy sound.
  • the fuzzy sound recognition model By combining the fuzzy sound recognition model with the context of the first text, we can get the characters that the user really needs for his speech, that is, get the second text that reflects the real semantics, so that we can correct the errors of the first text in speech recognition. According to the second text This allows the machine to more accurately understand the meaning of the user’s voice.
  • step 201 obtain the first text; the first text is generated after speech recognition of speech.
  • the first text is generated after speech recognition of the user's voice in an actual application scenario.
  • the user inputs a piece of speech to the intelligent customer service system, and the speech recognition system recognizes the speech, thereby obtaining the first text.
  • step 203 for each character in the first text, obtain the pinyin of the character.
  • step 205 According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.
  • This step 205 can also be implemented using the above-mentioned method 1 and method 2.
  • the implementation process of step 205 includes: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if yes, the first pronunciation of the pinyin will be replaced by the second pronunciation. The obtained pinyin is used as the fuzzy pronunciation corresponding to the character.
  • the implementation process of step 205 includes: splitting the pinyin of each character in the first text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants include The first pronunciation; if not, use the initial consonant directly as the fuzzy pronunciation corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character
  • the fuzzy sound corresponding to the initial consonant of the character for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, use the finals in the finals to determine whether the first pronunciation is included in the finals.
  • the first pronunciation is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
  • the second method when using the second method, for the separated finals, before determining whether the finals include the first pronunciation, it further includes: determining whether the finals include the beginning and end of the rhyme, and if so, delete the finals in the final; determining whether the finals include the first pronunciation. Whether the first pronunciation is included in: Determine whether the final after deleting the rhyme includes the first pronunciation.
  • step 205 For the specific implementation process of step 205, please refer to all relevant descriptions of step 105 mentioned above, and the processing ideas are the same.
  • step 207 input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; when step 205 is implemented using the second method, this The process of step 207 includes: for each character in the first text, generate a triplet corresponding to the character.
  • the triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character and the final sound corresponding to the character. Fuzzy sound; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound recognition model to be trained.
  • step 207 For the relevant description and understanding of step 207, please refer to the above-mentioned description of step 107, and the processing ideas are the same.
  • a training device for a fuzzy sound recognition model is provided. See Figure 3 , which includes: a sample text acquisition module 301 configured to obtain a semantic sample text including multiple characters;
  • the pinyin acquisition module 302 is configured to obtain the pinyin of each character in the sample text;
  • the fuzzy sound generation module 303 is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text;
  • the training execution module 304 is configured to use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
  • the fuzzy sound generation module 303 is configured to perform the following operations: for the pinyin of each character in the sample text, determine whether the pinyin includes a first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be used directly as the character The corresponding fuzzy sound; if so, replace the first pronunciation in the pinyin with the second pronunciation, and the pinyin obtained after the replacement is used as the corresponding fuzzy sound of the character.
  • the fuzzy sound generation module 303 is configured to perform the following operations: split the pinyin of each character in the sample text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants are in Including the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the The fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, add the finals to The first pronunciation of is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
  • the fuzzy sound generation module 303 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the beginning and the end of the rhyme. If If yes, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.
  • the training execution module 304 is configured to execute: for each character in the sample text, generate a triplet corresponding to the character, the triplet including: the character, the initial consonant of the character The corresponding fuzzy sound and the fuzzy sound corresponding to the final of the character; according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.
  • the labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.
  • a device for understanding speech meaning includes: a speech recognition result receiving module 401 configured to obtain a first text; the first text is obtained after speech recognition of speech. Generated; the character pinyin generation module 402 is configured to obtain the pinyin of each character in the first text; the character fuzzy sound generation module 403 is configured to obtain the fuzzy corresponding to each character based on the pinyin of each character. sound; the input module 404 is configured to input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module 405 is configured as The second text is understood to obtain the meaning of the speech.
  • the character fuzzy sound generation module 403 is configured to perform the following operations: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; The pronunciation matches: the pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the pronunciation of the first pronunciation. Second pronunciation, The pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.
  • the character fuzzy sound generation module 403 is configured to perform the following operations: split the pinyin of each character in the first text into initial consonants and final consonants; , determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the first pronunciation corresponding to the first pronunciation.
  • the second pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, the final is directly used as the fuzzy sound corresponding to the finals of the character; if so , then the first pronunciation in the final is replaced with the second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy sound corresponding to the final of the character.
  • the character fuzzy sound generation module 403 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the rhyme and the final pronunciation. At the end of the rhyme, if so, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.
  • the input module 404 is configured to perform the following operations: for each character in the first text, generate a triplet corresponding to the character, where the triplet includes: the character , the fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character; according to the order of each character in the first text, the triplets corresponding to each character are sequentially input into the fuzzy sound recognition model.
  • One embodiment of the present specification provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute the method in any embodiment of the specification.
  • One embodiment of this specification provides a computing device, including a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, it implements any of the embodiments in the specification. method.
  • the structures illustrated in the embodiments of this specification do not constitute specific limitations on the devices of the embodiments of this specification.
  • the above-mentioned device may include more or less components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.

Abstract

A fuzzy sound recognition model training method and apparatus, a speech meaning understanding method and apparatus, a computing device, and a computer readable storage medium. The fuzzy sound recognition model training method comprises: obtaining a sample text with semantic meaning comprising a plurality of characters (101); for each character in the sample text, acquiring pinyin of the character (103); on the basis of the pinyin of each character in the sample text, obtaining a fuzzy sound corresponding to each character (105); and, using the sample text, the fuzzy sound corresponding to each character in the sample text, and labels of the sample text, training a fuzzy sound recognition model (107). The present method enables speech meaning to be understood more accurately.

Description

模型训练方法和装置及语音含义的理解方法和装置Model training method and device and speech meaning understanding method and device 技术领域Technical field
本申请涉及电子信息技术,尤其涉及模糊音识别模型的训练方法和装置、语音含义的理解方法和装置。The present application relates to electronic information technology, and in particular to methods and devices for training fuzzy sound recognition models, and methods and devices for understanding speech meaning.
背景技术Background technique
目前,语音识别技术被广泛应用。在应用语音识别技术时,通常都会先对用户说出的语音进行识别,从语音转换为文本,然后再对文本的含义进行理解,从而得到语音的含义,并进行相关的处理。Currently, speech recognition technology is widely used. When applying speech recognition technology, the speech spoken by the user is usually recognized first, converted from speech to text, and then the meaning of the text is understood to obtain the meaning of the speech and perform related processing.
然而,目前的语音识别技术尚未发展成熟,在从语音转换为文本时,经常出现识别错误,根据错误的文本,则无法准确地理解出语音的含义。However, the current speech recognition technology is not yet mature. When converting from speech to text, recognition errors often occur. Based on the wrong text, the meaning of the speech cannot be accurately understood.
发明内容Contents of the invention
本说明书一个或多个实施例描述了模糊音识别模型的训练方法和装置、语音含义的理解方法和装置,能够更加准确地理解出语音的含义。One or more embodiments of this specification describe methods and devices for training fuzzy sound recognition models and methods and devices for understanding speech meaning, which can more accurately understand the meaning of speech.
根据第一方面,提供了一种模糊音识别模型的训练方法,包括:得到包括多个字符的具有语义的样本文本;对样本文本中的每一个字符,获取该字符的拼音;根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。According to the first aspect, a training method for a fuzzy sound recognition model is provided, including: obtaining a sample text with semantics including multiple characters; for each character in the sample text, obtaining the pinyin of the character; according to the sample text The pinyin of each character is obtained to obtain the fuzzy sound corresponding to each character; the fuzzy sound recognition model is trained using the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text.
所述根据样本文本中的每一个字符的拼音得到每一个字符对应的模糊音,包括:针对样本文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;如果否,则将该拼音直接作为该字符对应的模糊音;如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: judging whether the pinyin of each character in the sample text includes the first pronunciation; the first pronunciation conforms to: a The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced with the second pronunciation, and the replacement The resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
所述根据样本文本中的每一个字符的拼音得到每一个字符对应的模糊音,包括:将样本文本中每一个字符的拼音拆分成声母以及韵母;针对拆分出的声母,判断声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应 的模糊音;针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。Obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the sample text includes: splitting the pinyin of each character in the sample text into initial consonants and finals; judging whether the split initial consonants are in the initial consonants. Including the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the Initial consonant correspondence of characters fuzzy sound; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if yes, use the first pronunciation of the finals Replace it with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
针对拆分出的韵母,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;所述判断韵母中是否包括第一发音包括:判断删除韵头后的韵母中是否包括第一发音。For the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the beginning of the final; and determining whether the final includes the first pronunciation. One pronunciation includes: Determine whether the finals after deleting the rhyme include the first pronunciation.
所述训练所述模糊音识别模型,包括:针对该样本文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;按照每一个字符在样本文本中的顺序,依次将各字符对应的三元组以及所述标签输入待训练的所述模糊音识别模型。The training of the fuzzy sound recognition model includes: for each character in the sample text, generating a triplet corresponding to the character. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the character. The fuzzy sound corresponding to the final of the character; according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.
所述样本文本的标签包括:从情绪维度、所属领域维度、标的物维度、文本含义维度中的至少一个维度给出的标签。The labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.
根据第二方面,提供了一种语音含义的理解方法,包括:得到第一文本;该第一文本是对语音进行语音识别后生成的;对第一文本中的每一个字符,获取该字符的拼音;根据每一个字符的拼音,得到每一个字符对应的模糊音;将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;对第二文本进行理解,得到所述语音的含义。According to the second aspect, a method for understanding the meaning of speech is provided, including: obtaining a first text; the first text is generated after speech recognition; for each character in the first text, obtaining the character's Pinyin; according to the pinyin of each character, obtain the fuzzy sound corresponding to each character; input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second fuzzy sound recognition model output Text; understand the second text and obtain the meaning of the speech.
所述根据第一文本中每一个字符的拼音得到每一个字符对应的模糊音,包括:针对第一文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;如果否,则将该拼音直接作为该字符对应的模糊音;如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。The method of obtaining the fuzzy sound corresponding to each character according to the pinyin of each character in the first text includes: judging whether the pinyin of each character in the first text includes the first pronunciation; the first pronunciation conforms to: The pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the second pronunciation. The pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.
所述根据每一个字符的拼音得到每一个字符对应的模糊音,包括:将第一文本中每一个字符的拼音拆分成声母以及韵母;针对拆分出的声母,判断该声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。 Obtaining the fuzzy sound corresponding to each character based on the pinyin of each character includes: splitting the pinyin of each character in the first text into initial consonants and finals; and judging whether the split initial consonants include the first consonant. one pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character's The fuzzy sound corresponding to the initial consonant; for the split final, determine whether the final contains the first pronunciation; if not, use the final as the fuzzy sound corresponding to the final of the character directly; if so, use the final in the final as the fuzzy sound corresponding to the character. One pronunciation is replaced with a second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy pronunciation corresponding to the final of the character.
针对拆分出的韵母,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;判断韵母中是否包括第一发音包括:判断该删除韵头后的韵母中是否包括第一发音。For the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and end of the rhyme, and if so, deleting the finals in the finals; determining whether the finals include the first pronunciation. Include: Determine whether the final sound after the deleted rhyme includes the first pronunciation.
所述将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,包括:针对该第一文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;按照每一个字符在第一文本中的顺序,依次将各字符对应的三元组输入所述模糊音识别模型。The input of the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model includes: for each character in the first text, generating a triplet corresponding to the character, and the triplet The group includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound in turn. Identify the model.
根据第三方面,提供了一种模糊音识别模型的训练装置,包括:样本文本获取模块,配置为得到包括多个字符的具有语义的样本文本;拼音获取模块,配置为对样本文本中的每一个字符,获取该字符的拼音;模糊音生成模块,配置为根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;训练执行模块,配置为利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。According to the third aspect, a training device for a fuzzy sound recognition model is provided, including: a sample text acquisition module configured to obtain a semantic sample text including multiple characters; a pinyin acquisition module configured to obtain each character in the sample text A character, obtains the pinyin of the character; the fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text; the training execution module is configured to use the sample text, the sample text The fuzzy sound corresponding to each character and the label of the sample text are used to train the fuzzy sound recognition model.
根据第四方面,提供了一种语音含义的理解装置,包括:语音识别结果接收模块,配置为得到第一文本;该第一文本是对语音进行语音识别后生成的;字符拼音生成模块,配置为对样本文本中的每一个字符,获取该字符的拼音;字符模糊音生成模块,配置为根据每一个字符的拼音,得到每一个字符对应的模糊音;输入模块,配置为将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;语音含义理解模块,配置为对第二文本进行理解,得到所述语音的含义。According to a fourth aspect, a device for understanding speech meaning is provided, including: a speech recognition result receiving module configured to obtain a first text; the first text is generated after speech recognition of speech; a character pinyin generation module configured In order to obtain the pinyin of each character in the sample text; the character fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character; the input module is configured to convert the first text, The fuzzy sound corresponding to each character in the first text is input into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module is configured to understand the second text and obtain the meaning of the speech.
根据第五方面,提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现本说明书任一实施例所述的方法。According to a fifth aspect, a computing device is provided, including a memory and a processor. The memory stores executable code. When the processor executes the executable code, it implements the method described in any embodiment of this specification. method.
本说明书实施例能够训练出可以纠正语音识别的文字错误的模糊音识别模型,并基于该模型能够更加准确地理解语音的含义。The embodiments of this specification can train a fuzzy sound recognition model that can correct text errors in speech recognition, and based on this model, the meaning of speech can be more accurately understood.
附图说明Description of the drawings
为了更清楚地说明本说明书实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of this specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of this specification. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1是本说明书一个实施例中模糊音识别模型的训练方法的流程图。 Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification.
图2是本说明书一个实施例中语音含义的理解方法的流程图。Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification.
图3是本说明书一个实施例中模糊音识别模型的训练装置的结构示意图。Figure 3 is a schematic structural diagram of a training device for a fuzzy sound recognition model in one embodiment of this specification.
图4是本说明书一个实施例中语音含义的理解装置的结构示意图。Figure 4 is a schematic structural diagram of a device for understanding speech meaning in an embodiment of this specification.
具体实施方式Detailed ways
如前所述,目前在从语音转换为文本时,经常出现识别错误,根据错误的文本,则无法准确地理解出语音所要表达的含义。举例说明:比如机器实现的智能客服系统向用户提问:请问您买的是实物还是虚拟物品?用户使用语音进行回答,用户原本希望回答的是实物,但是因为用户使用的是方言,因此导致语音识别出错,识别出的文本为:四五。这样,智能客服系统则无法根据识别出的错误的文本,来理解用户语音所要表达的含义,从而导致业务出错。As mentioned before, recognition errors often occur when converting from speech to text. According to the wrong text, the meaning of the speech cannot be accurately understood. For example: For example, the machine-implemented intelligent customer service system asks the user: Are you buying physical or virtual items? The user uses voice to answer. The user originally hoped that the answer would be a physical object, but because the user uses a dialect, the voice recognition error occurs, and the recognized text is: four or five. In this way, the intelligent customer service system cannot understand the meaning of the user's voice based on the identified incorrect text, resulting in business errors.
下面结合附图,对本说明书提供的方案进行描述。The solutions provided in this specification will be described below in conjunction with the accompanying drawings.
针对语音识别的过程进行分析可知,语音识别出错的一个重要原因是:用户的语音发音不标准,不同地区的人可能使用不同的方言,将一个发音与另一个发音相混。比如,平舌音跟翘舌音相混,一些地方的人,会把拼音中的“z、c、s”的音发成“zh、ch、sh”,如把“自行车”读成“zhi xing che”行车。再如,一些地方的人,会把拼音中的“j、q、x”的音发成“z、c、s”,如把“进修”读作“zin siu”。又如,一些地方的人,会把拼音中的“f”与“h”的发音相混,如将“反对”读作“huǎn dui”等。An analysis of the speech recognition process shows that an important reason for speech recognition errors is that the user's speech pronunciation is not standard. People in different regions may use different dialects and confuse one pronunciation with another. For example, if the flat tongue sound is mixed with the raised tongue sound, people in some places will pronounce the "z, c, s" sounds in Pinyin as "zh, ch, sh", such as "bicycle" is pronounced as "zhi" xing che" driving. For another example, people in some places pronounce the sounds of "j, q, x" in Pinyin as "z, c, s", such as "learning" as "zin siu". Another example is that people in some places confuse the pronunciation of "f" and "h" in Pinyin, such as pronouncing "objection" as "huǎn dui".
因此,如果能够将语音识别出的文本中的每一个字符的发音都利用发音相混的特点来进行修正,则可以有效解决因为用户发音不标准导致的语音识别出错的问题。Therefore, if the pronunciation of each character in the speech-recognized text can be corrected using the characteristics of pronunciation mixing, it can effectively solve the problem of speech recognition errors caused by users' non-standard pronunciation.
为了能够利用发音相混的特点修正语音识别的错误,在本说明书一个实施例中,可以预先训练一个模糊音识别模型,在实际业务应用中,则可以利用该模糊音识别模型来修正语音识别出的文本,从而更为正确地理解语音的含义。In order to correct speech recognition errors by utilizing the characteristics of pronunciation mixing, in one embodiment of this specification, a fuzzy sound recognition model can be pre-trained. In actual business applications, the fuzzy sound recognition model can be used to correct speech recognition errors. text to more accurately understand the meaning of the speech.
本说明书实施例的方法可以适用于各种应用语音识别的场景中。比如,包括以下场景一至场景三。The methods in the embodiments of this specification can be applied to various speech recognition application scenarios. For example, include the following scenarios one to three.
场景一、智能客服系统Scenario 1. Intelligent customer service system
在用户通过电话或者网络输入一段语音后,智能客服系统(比如支付宝平台的机器客服)会进行语音识别,识别出一段文本,应用本说明书实施例提供的模糊音识别模型 及语音含义的理解方法,则能够纠正语音识别出的文本的错误,得到正确的更加符合用户意图的文本,从而让机器更加正确地理解出用户语音的含义,比如用户购买的是实物还是虚拟物品,需要退货还是换货等。After the user inputs a piece of speech through the phone or the Internet, the intelligent customer service system (such as the machine customer service of the Alipay platform) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model provided by the embodiment of this specification. And the method of understanding the meaning of speech can correct the errors in the text recognized by speech and obtain the correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's speech, such as whether the user is purchasing physical or virtual items. , need to return or exchange, etc.
场景二、人工智能系统Scenario 2. Artificial Intelligence System
在用户通过现场对话方式、电话或者网络方式发出一段语音后,人工智能系统(比如机器人)会进行语音识别,识别出一段文本,应用本说明书实施例提供的模糊音识别模型及语音含义的理解方法,人工智能系统(比如机器人)则能够纠正语音识别出的文本的错误,得到正确的更加符合用户意图的文本,从而让机器更加正确地理解出用户语音的含义,比如命令机器人改变行走路线等。After the user sends a piece of speech through live conversation, telephone or Internet, the artificial intelligence system (such as a robot) will perform speech recognition, identify a piece of text, and apply the fuzzy sound recognition model and the method for understanding the meaning of the speech provided by the embodiments of this specification. , artificial intelligence systems (such as robots) can correct errors in text recognized by speech, and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as ordering the robot to change its walking route.
场景三、基于物联网的智能家居系统Scenario 3. Smart home system based on the Internet of Things
在用户通过现场对话方式、电话或者网络方式发出一段语音后,智能家居系统(比如智能电视)会进行语音识别,识别出一段文本,应用本说明书实施例提供的模糊音识别模型及语音含义的理解方法,智能家居系统(比如智能电视)则能够纠正语音识别出的文本的错误,得到正确的更加符合用户意图的文本,从而让机器更加正确地理解出用户语音的含义,比如命令智能电视开机录制某一时刻的电视节目等。After the user sends a piece of speech through live conversation, telephone or network, the smart home system (such as a smart TV) will perform speech recognition, recognize a piece of text, and apply the fuzzy sound recognition model provided by the embodiments of this specification to understand the meaning of the speech. Method, smart home systems (such as smart TVs) can correct text errors in speech recognition and obtain correct text that is more in line with the user's intention, so that the machine can more correctly understand the meaning of the user's voice, such as commanding the smart TV to start recording. TV shows at a certain time, etc.
下面将分两个方面说明本说明书实施例的实现方式,第一方面说明模糊音识别模型的训练方法,第二方面说明语音含义的理解方法。The following will describe the implementation of the embodiments of this specification in two aspects. The first aspect describes the training method of the fuzzy sound recognition model, and the second aspect describes the method of understanding the meaning of speech.
首先,在第一方面,说明模糊音识别模型的训练方法。First, in the first aspect, the training method of the fuzzy sound recognition model is explained.
图1是本说明书一个实施例中模糊音识别模型的训练方法的流程图。该方法的执行主体为模糊音识别模型的训练装置。可以理解,该方法也可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。参见图1,该方法包括:步骤101:得到包括多个字符的具有语义的样本文本;步骤103:对样本文本中的每一个字符,获取该字符的拼音;步骤105:根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;步骤107:利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练模糊音识别模型。Figure 1 is a flow chart of a training method for a fuzzy sound recognition model in one embodiment of this specification. The execution subject of this method is the training device of the fuzzy sound recognition model. It can be understood that this method can also be executed by any device, device, platform, or device cluster with computing and processing capabilities. Referring to Figure 1, the method includes: Step 101: Obtain a semantic sample text including multiple characters; Step 103: For each character in the sample text, obtain the pinyin of the character; Step 105: According to each character in the sample text The pinyin of a character is used to obtain the fuzzy sound corresponding to each character; Step 107: Use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
可见,在上述图1所示的过程中,在对模糊音识别模型进行训练时,考虑到用户的发音不标准,对一个字符比如汉字的发音会与其他汉字的发音相混,因此,设计了模糊音的概念,这样,无论用户使用何种口音或者发音方式,都可以通过模糊音,将因为不同发音而导致相混的两个字符统一为同一个模糊音的发音,从而让模糊音识别模型学习 到被混读的字符的发音方式,并结合样本文本的上下文,得到正确的字符,从而使得此种模糊音识别模型能够用于后续纠正语音识别的文本的错误。It can be seen that in the process shown in Figure 1 above, when training the fuzzy sound recognition model, considering that the user's pronunciation is not standard, the pronunciation of a character such as a Chinese character will be mixed with the pronunciation of other Chinese characters. Therefore, a The concept of fuzzy sound, so that no matter what accent or pronunciation method the user uses, the fuzzy sound can be used to unify the two characters that are mixed due to different pronunciations into the same fuzzy sound pronunciation, thus making the fuzzy sound recognition model study By understanding the pronunciation of the mixed characters and combining it with the context of the sample text, the correct characters are obtained, so that this fuzzy speech recognition model can be used to subsequently correct errors in speech recognition texts.
下面结合具体的例子对图1所示的每一个步骤分别进行说明。Each step shown in Figure 1 will be described below with specific examples.
首先,在步骤101:得到包括多个字符的具有语义的样本文本。First, in step 101: Obtain a semantic sample text including multiple characters.
为了对模糊音识别模型进行训练,需要样本文本。样本文本可以是任意一种类型的具有语义的文本,比如一篇文章、一段用户投诉文字、一段产品说明文字等。为了能让模糊音识别模型学习用户口音导致的发音错误及对应的字符错误的各种情况,该样本文本中应该包括至少一个非标准发音(用户口音或者方言)形成的字符。比如,样本文本中包括“……至行车…,…发路”,因为用户口音,自行车中的自“zi”会发音成“zhi”,“zhi”对应出“至”这个非标准发音形成的字符,同时,因为用户口音,发怒中的怒“nu”会发音成“lu”,“lu”对应出“路”这个非标准发音形成的字符。In order to train the fuzzy sound recognition model, sample text is required. The sample text can be any type of semantic text, such as an article, a user complaint text, a product description text, etc. In order to allow the fuzzy sound recognition model to learn various situations of pronunciation errors and corresponding character errors caused by the user's accent, the sample text should include at least one character formed by non-standard pronunciation (user's accent or dialect). For example, the sample text includes "...to drive...,...to make a road". Due to the user's accent, "zi" in bicycle will be pronounced as "zhi", and "zhi" corresponds to the non-standard pronunciation of "to". At the same time, due to the user's accent, "nu" in anger will be pronounced as "lu", and "lu" corresponds to the character formed by the non-standard pronunciation of "lu".
在本说明书的实施例中,字符可以包括:汉字、英文字母、标点符号中的至少一个。In the embodiment of this specification, the characters may include: at least one of Chinese characters, English letters, and punctuation marks.
在本说明书的实施例中,样本文本具有标签,该标签可以是从情绪维度、所属领域维度、标的物维度、文本含义维度中的至少一个维度所给出的标签。比如,标签表征了样本文本所表达的情绪是愤怒;标签表征该样本文本属于用户投诉领域;标签表征样本文本的含义是用户购买了实体物品等,以便模糊音识别模型根据该标签对样本文本中的每一个字符及其模糊音是否正确进行学习。In the embodiment of this specification, the sample text has a label, which may be a label given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension. For example, the label represents that the emotion expressed in the sample text is anger; the label represents that the sample text belongs to the field of user complaints; the label represents that the meaning of the sample text is that the user purchased physical items, etc., so that the fuzzy sound recognition model can identify the content of the sample text based on the label. Whether each character and its fuzzy sounds are learned correctly.
接下来对于步骤103:对样本文本中的每一个字符,获取该字符的拼音。Next, step 103: for each character in the sample text, obtain the pinyin of the character.
这里,可以从字典中获取每一个字符的标准的拼音。Here, the standard pinyin of each character can be obtained from the dictionary.
比如,样本文本中包括“……至行车…,…发路”,在本步骤103中,对于其中的字符“至”,获取其拼音为“zhi”;对于字符“行”,获取其拼音为“xing”;对于字符“车”,获取其拼音为“che”;对于字符“路”,获取其拼音为“lu”。For example, the sample text includes "...to drive...,...to send the road". In this step 103, for the character "to", the pinyin is obtained as "zhi"; for the character "行", the pinyin is obtained as "xing"; for the character "car", obtain its pinyin as "che"; for the character "路", obtain its pinyin as "lu".
接下来对于步骤105:根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音。Next, step 105: According to the pinyin of each character in the sample text, obtain the fuzzy pronunciation corresponding to each character.
在本说明书实施例中,设计了模糊音的概念,一个字符的拼音对应的模糊音符合:该拼音的发音不包括读音容易被混淆的第一发音时,拼音与该模糊音相同,而该拼音的发音包括第一发音时,该拼音的发音与模糊音的发音相混,这样,无论用户使用何种口音或者发音方式,都可以通过模糊音,将因为不同发音而导致相混的两个字符统一为同 一个模糊音的发音,从而让模糊音识别模型学习到被混读的字符的发音方式。In the embodiment of this specification, the concept of fuzzy sound is designed. The fuzzy sound corresponding to the pinyin of a character is consistent with: when the pronunciation of the pinyin does not include the first pronunciation that is easily confused, the pinyin is the same as the fuzzy sound, and the pinyin is the same as the fuzzy sound. When the pronunciation includes the first pronunciation, the pronunciation of the pinyin is mixed with the pronunciation of the fuzzy pronunciation. In this way, no matter what accent or pronunciation method the user uses, the fuzzy pronunciation can be used to distinguish the two characters that are confused due to different pronunciations. unified into same The pronunciation of a fuzzy sound allows the fuzzy sound recognition model to learn the pronunciation of the mixed-pronounced character.
如前所述,由于用户口音或者方言等原因,用户经常会将一个发音的读音混淆为另一个发音的读音,比如把发音“z、c、s”的读音分别混淆为发音“zh、ch、sh”的读音,把发音“ing”的读音混淆为“in”的读音,把发音“f”的读音混淆成“h”的读音等。因此,可以预先设置在读音上会相互混淆的第一发音与第二发音的对应关系,明确出哪些对发音会相互混淆。比如通过如下表1记录第一发音与第二发音的对应关系。
As mentioned before, due to user accent or dialect, users often confuse the pronunciation of one pronunciation with the pronunciation of another pronunciation. For example, the pronunciations of "z, c, s" are confused with the pronunciations of "zh, ch, ch" respectively. The pronunciation of "sh" is confused, the pronunciation of "ing" is confused with the pronunciation of "in", the pronunciation of "f" is confused with the pronunciation of "h", etc. Therefore, the corresponding relationship between the first pronunciation and the second pronunciation that are likely to be confused with each other in pronunciation can be set in advance to clarify which pairs of pronunciations are likely to be confused with each other. For example, the corresponding relationship between the first pronunciation and the second pronunciation is recorded in Table 1 below.
表1Table 1
在上述表1中,第一发音通常是带有口音或者使用方言的用户所发出的读音,第二发音是字符原本的读音。可以理解,上述表1只是示意性的,在实际的业务中,可以根据应用地点的不同,即用户口音特点的不同,设置不同的第一发音与第二发音的对应关系。在设置了诸如表1所示的对应关系之后,则可以利用该对应关系,对样本文本中的字符的拼音进行模糊音的替换。In the above Table 1, the first pronunciation is usually the pronunciation of users with accents or dialects, and the second pronunciation is the original pronunciation of the characters. It can be understood that the above Table 1 is only schematic. In actual services, different correspondences between the first pronunciation and the second pronunciation can be set according to different application locations, that is, different accent characteristics of users. After setting the corresponding relationship such as that shown in Table 1, the corresponding relationship can be used to replace the pinyin of the characters in the sample text with fuzzy sounds.
步骤105包括如下两种实现方式:方式一、以一个拼音为单位,进行拼音的模糊音的替换。方式二、以一个声母、韵母分别为一个单位,进行拼音的模糊音的替换。Step 105 includes the following two implementation methods: Method 1: Use one pinyin as a unit to replace the fuzzy sounds of pinyin. Method 2: Use an initial consonant and a final vowel as a unit to replace the fuzzy sounds of Pinyin.
首先说明方式一:在本说明书一个实施例中,基于方式一的步骤105的具体实现过程包括:步骤1051A:针对样本文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音,如果否,则执行步骤1053A,如果是,则执行步骤1055A。Method 1 will be described first: In one embodiment of this specification, the specific implementation process of step 105 based on method 1 includes: Step 1051A: For the pinyin of each character in the sample text, determine whether the pinyin includes the first pronunciation; One pronunciation matches: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation. If not, step 1053A will be executed. If yes, step 1055A will be executed.
步骤1053A:将该拼音直接作为该字符对应的模糊音。 Step 1053A: Use the pinyin directly as the fuzzy sound corresponding to the character.
步骤1055A:将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。Step 1055A: Replace the first pronunciation in the pinyin with the second pronunciation, and the resulting pinyin after replacement is used as the fuzzy pronunciation corresponding to the character.
举例说明上述步骤1051A至步骤1055A的过程。比如,在上述步骤103中,对于样本文本中包括的各个字符“……至行车…,…发路”分别获取该各个字符对应的拼音为“zhi”、“xing”、“che”、“lu”。这样,在步骤1051A至步骤1055A的过程中,首先对于字符“至”的拼音“zhi”,因为该拼音“zhi”中包括表1中的一个第一发音“zh”,因此,将该拼音中的第一发音“zh”替换为其对应的第二发音“z”,替换后得到的拼音“zi”作为该字符“至”对应的模糊音;接下来对于字符“行”的拼音“xing”,因为该拼音“xing”中不包括表1中的任何第一发音,因此直接将该拼音“xing”作为字符“行”对应的模糊音;以此类推,对于字符“路”的拼音“lu”,因为该拼音“lu”中包括表1中的第一发音“l”,因此,将该拼音中的第一发音“l”替换为其对应的第二发音“n”,替换后得到的拼音“nu”作为字符“路”对应的模糊音。The above process from step 1051A to step 1055A is described with an example. For example, in step 103 above, for each character "...to driving...,...falu" included in the sample text, the pinyin corresponding to each character is obtained as "zhi", "xing", "che", and "lu" respectively. ". In this way, in the process from step 1051A to step 1055A, first of all, for the pinyin "zhi" of the character "to", because the pinyin "zhi" includes a first pronunciation "zh" in Table 1, therefore, the pinyin is The first pronunciation "zh" is replaced with its corresponding second pronunciation "z", and the pinyin "zi" obtained after the replacement is used as the fuzzy sound corresponding to the character "to"; next, the pinyin "xing" of the character "行" , because the pinyin "xing" does not include any of the first pronunciations in Table 1, so the pinyin "xing" is directly used as the fuzzy sound corresponding to the character "行"; and by analogy, for the pinyin "lu" of the character "路" ", because the pinyin "lu" includes the first pronunciation "l" in Table 1, therefore, the first pronunciation "l" in the pinyin is replaced with its corresponding second pronunciation "n", and the replacement is obtained The pinyin "nu" is the fuzzy sound corresponding to the character "路".
这样,经过步骤1051A至步骤1055A的处理,得到的各字符对应的模糊音依次包括:“……zi xing che,…fa nu”。In this way, after the processing from step 1051A to step 1055A, the fuzzy sounds corresponding to each character obtained include: "...zi xing che,...fa nu".
下面说明方式二:为了能够提高训练效率,降低训练难度,在本说明书一个实施例中,可以采用方式二,即,对拼音进行拆分,拆分成声母以及韵母,然后分别判断声母中是否包括第一发音,以及判断韵母中是否包括第一发音,然后分别替换模糊音。此时,步骤105具体包括:步骤1051B:将样本文本中每一个字符的拼音拆分成声母以及韵母;步骤1053B:针对拆分出的声母,判断声母是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;步骤1055B:针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。Method 2 is described below: In order to improve training efficiency and reduce training difficulty, in one embodiment of this specification, method 2 can be adopted, that is, split Pinyin into initial consonants and finals, and then determine whether the initial consonants include The first pronunciation, and determine whether the finals include the first pronunciation, and then replace the fuzzy sounds respectively. At this time, Step 105 specifically includes: Step 1051B: Split the pinyin of each character in the sample text into initial consonants and finals; Step 1053B: For the split initial consonants, determine whether the initial consonants include the first pronunciation; if not, The initial consonant is directly used as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the initial consonant of the character; Step 1055B : For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation of the finals with this The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
在本说明书一个实施例中,为了进一步提高训练效率,降低训练难度,还可以进一步对韵母进行简化,即删除韵母中的韵头部分。比如对于拼音“guang”,其中,韵母包括:“uang”,韵母中的韵头“u”对于拼音“guang”的发音的贡献度比较小,而韵母“ang”对于拼音“guang”的发音的贡献度比较大,是韵母发音的关键部分,因此,为了提高效率,可以在训练过程中不考虑该韵头对发音的影响。这样,在上述步骤1055B中,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾, 如果是,则删除该韵母中的韵头;然后,在步骤1055B中,判断删除韵头后的韵母中是否包括第一发音;如果否,则将该删除韵头后的韵母直接作为该字符的韵母对应的模糊音;如果是,则将该删除韵头后的韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。In one embodiment of this specification, in order to further improve training efficiency and reduce training difficulty, the finals can be further simplified, that is, the rhyme part in the finals can be deleted. For example, for the pinyin "guang", the finals include: "uang", the final "u" in the final has a relatively small contribution to the pronunciation of the pinyin "guang", while the final "ang" contributes a lot to the pronunciation of the pinyin "guang" It has a relatively large contribution and is a key part of the pronunciation of finals. Therefore, in order to improve efficiency, the impact of this final on pronunciation can be ignored during the training process. In this way, in the above step 1055B, before judging whether the final sounds include the first pronunciation, it further includes: judging whether the final sounds include the beginning and the end of the rhyme, If yes, then delete the final in the final; then, in step 1055B, determine whether the final after deleting the final includes the first pronunciation; if not, use the final after deleting the final as the character. The fuzzy sound corresponding to the final; if so, replace the first pronunciation in the final with the deleted rhyme with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
接下来对于步骤107:利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。Next, step 107: use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
仍以上述例子说明,如果步骤105采用上述方式一,则本步骤107中输入模糊音识别模型的信息包括:“……至(zi)行(xing)车(che)…,…发(fa)路(nu)”以及样本文本的标签如“交通事故纠纷”。Still using the above example, if step 105 adopts the above method one, the information input to the fuzzy sound recognition model in step 107 includes: "...to (zi) line (xing) car (che)...,...fa (fa) Road (nu)" and the label of the sample text such as "traffic accident dispute".
如果步骤105采用上述方式二,则本步骤107的具体实现过程包括:步骤1071:针对该样本文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;步骤1073:按照每一个字符在样本文本中的顺序,依次将各字符对应的三元组以及标签输入待训练的模糊音识别模型。If step 105 adopts the second method above, the specific implementation process of step 107 includes: Step 1071: For each character in the sample text, generate a triplet corresponding to the character. The triplet includes: the character, the The fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character; Step 1073: According to the order of each character in the sample text, input the triplet and label corresponding to each character into the fuzzy sound recognition model to be trained. .
对比方式一及方式二,正常来说,声母有23个,韵母有24个,那么,采用方式一实现时,对于模糊音识别模型来说,为了学习模糊音,总共有23*24个未知数需要学习。而采用方式二实现时,对于模糊音识别模型来说,为了学习模糊音,总共有23+24个未知数需要学习。可见,方式二,能够大大提高模糊音识别模型的训练效率,降低训练难度。Comparing Method 1 and Method 2, normally, there are 23 initial consonants and 24 finals. So, when implemented using Method 1, for the fuzzy sound recognition model, in order to learn fuzzy sounds, a total of 23*24 unknowns are needed. study. When implemented using the second method, for the fuzzy sound recognition model, in order to learn fuzzy sounds, there are a total of 23+24 unknowns that need to be learned. It can be seen that the second method can greatly improve the training efficiency of the fuzzy sound recognition model and reduce the training difficulty.
无论采用方式一还是方式二,因为模糊音识别模型从训练过程的其他字符中已经学习到“自”的发音为“zi”,“怒”的发音为“nu”,再结合样本文本的上下文以及标签,该模糊音识别模型根据输入的信息则可以对样本文本进行纠正,比如纠正为……自行车…,…发怒”,从而得到正确的含义。Regardless of whether method one or two is adopted, because the fuzzy sound recognition model has learned from other characters in the training process that the pronunciation of "自" is "zi" and the pronunciation of "nu" is "nu", combined with the context of the sample text and tag, the fuzzy sound recognition model can correct the sample text based on the input information, for example, correct it to...bicycle...,...angry" to get the correct meaning.
可以理解,对于模糊音识别模型的训练会进行多轮,使用多个样本文本进行训练,每一轮的训练过程可以参见上述实施例,直至模糊音识别模型收敛。It can be understood that the training of the fuzzy sound recognition model will be conducted for multiple rounds, using multiple sample texts for training. Refer to the above embodiment for the training process of each round until the fuzzy sound recognition model converges.
在训练完毕模糊音识别模型之后,则可以利用该模糊音识别模型进行语音含义的理解。After training the fuzzy sound recognition model, the fuzzy sound recognition model can be used to understand the meaning of the speech.
下面说明第二方面,语音含义的理解方法。The following describes the second aspect, the method of understanding the meaning of speech.
图2是本说明书一个实施例中语音含义的理解方法的流程图。该方法的执行主体为语音含义的理解装置。可以理解,该方法也可以通过任何具有计算、处理能力的装置、 设备、平台、设备集群来执行。参见图2,该方法包括步骤201至步骤209。Figure 2 is a flow chart of a method for understanding speech meaning in one embodiment of this specification. The execution subject of this method is a device for understanding speech meaning. It can be understood that this method can also be implemented by any device with computing and processing capabilities. devices, platforms, and device clusters to execute. Referring to Figure 2, the method includes steps 201 to 209.
步骤201:得到第一文本;该第一文本是对语音进行语音识别后生成的。Step 201: Obtain the first text; the first text is generated after speech recognition.
步骤203:对第一文本中的每一个字符,获取该字符的拼音。Step 203: For each character in the first text, obtain the pinyin of the character.
步骤205:根据第一文本中的每一个字符的拼音,得到第一文本中的每一个字符对应的模糊音。Step 205: According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.
步骤207:将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;步骤209:对第二文本进行理解,得到语音的含义。Step 207: Input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; Step 209: Understand the second text and obtain the speech meaning.
可见,在上述图2所示的过程中,无论用户使用何种口音或者发音方式,都可以通过模糊音,将因为不同发音而导致相混的两个字符统一为同一个模糊音的发音,这样让模糊音识别模型结合第一文本的上下文,就可以得到用户的语音真正需要的字符,即得到反映真实语义的第二文本,从而能够纠正语音识别的第一文本的错误,根据第二文本则可以让机器更加准确地理解用户语音的含义。It can be seen that in the process shown in Figure 2 above, no matter what accent or pronunciation method the user uses, the fuzzy sound can be used to unify the two characters that are mixed due to different pronunciations into the same fuzzy sound. In this way By combining the fuzzy sound recognition model with the context of the first text, we can get the characters that the user really needs for his speech, that is, get the second text that reflects the real semantics, so that we can correct the errors of the first text in speech recognition. According to the second text This allows the machine to more accurately understand the meaning of the user’s voice.
下面对图2中的每一个步骤进行说明。Each step in Figure 2 is explained below.
首先对于步骤201:得到第一文本;该第一文本是对语音进行语音识别后生成的。First, step 201: obtain the first text; the first text is generated after speech recognition of speech.
这里,第一文本是在实际应用场景中,对用户语音进行语音识别后生成的,比如,用户向智能客服系统输入一段语音,由语音识别系统对该语音进行识别,从而得到了第一文本。Here, the first text is generated after speech recognition of the user's voice in an actual application scenario. For example, the user inputs a piece of speech to the intelligent customer service system, and the speech recognition system recognizes the speech, thereby obtaining the first text.
接下来对于步骤203:对第一文本中的每一个字符,获取该字符的拼音。Next, step 203: for each character in the first text, obtain the pinyin of the character.
这里,可以查找每一个字符在字典中的拼音。Here, you can find the pinyin of each character in the dictionary.
接下来对于步骤205:根据第一文本中的每一个字符的拼音,得到第一文本中的每一个字符对应的模糊音。Next, step 205: According to the pinyin of each character in the first text, obtain the fuzzy pronunciation corresponding to each character in the first text.
本步骤205也可以采用上述方式一及方式二两种方式来实现。This step 205 can also be implemented using the above-mentioned method 1 and method 2.
当采用方式一时,在本说明书一个实施例中,本步骤205的实现过程包括:针对第一文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;如果否,则将该拼音直接作为该字符对应的模糊音;如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。 When using the first method, in one embodiment of this specification, the implementation process of step 205 includes: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if yes, the first pronunciation of the pinyin will be replaced by the second pronunciation. The obtained pinyin is used as the fuzzy pronunciation corresponding to the character.
当采用方式二时,在本说明书一个实施例中,本步骤205的实现过程包括:将第一文本中每一个字符的拼音拆分成声母以及韵母;针对拆分出的声母,判断声母是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。When using the second method, in one embodiment of this specification, the implementation process of step 205 includes: splitting the pinyin of each character in the first text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants include The first pronunciation; if not, use the initial consonant directly as the fuzzy pronunciation corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the character The fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, use the finals in the finals to determine whether the first pronunciation is included in the finals. The first pronunciation is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
在采用方式二时,针对拆分出的韵母,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;判断韵母中是否包括第一发音包括:判断该删除韵头后的韵母中是否包括第一发音。When using the second method, for the separated finals, before determining whether the finals include the first pronunciation, it further includes: determining whether the finals include the beginning and end of the rhyme, and if so, delete the finals in the final; determining whether the finals include the first pronunciation. Whether the first pronunciation is included in: Determine whether the final after deleting the rhyme includes the first pronunciation.
本步骤205的具体实现过程也可以参见上述对步骤105的所有相关描述,其处理的思路是相同的。For the specific implementation process of step 205, please refer to all relevant descriptions of step 105 mentioned above, and the processing ideas are the same.
接下来对于步骤207:将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;当步骤205采用方式二实现时,本步骤207的过程包括:针对该第一文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;按照每一个字符在第一文本中的顺序,依次将各字符对应的三元组输入待训练的模糊音识别模型。Next, step 207: input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; when step 205 is implemented using the second method, this The process of step 207 includes: for each character in the first text, generate a triplet corresponding to the character. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character and the final sound corresponding to the character. Fuzzy sound; according to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound recognition model to be trained.
对于步骤207的相关说明及理解可以参见上述对步骤107的相关说明,其处理思路是相同的。For the relevant description and understanding of step 207, please refer to the above-mentioned description of step 107, and the processing ideas are the same.
在本说明书的一个实施例中,提供了一种模糊音识别模型的训练装置,参见图3,包括:样本文本获取模块301,配置为得到包括多个字符的具有语义的样本文本;In one embodiment of this specification, a training device for a fuzzy sound recognition model is provided. See Figure 3 , which includes: a sample text acquisition module 301 configured to obtain a semantic sample text including multiple characters;
拼音获取模块302,配置为对样本文本中的每一个字符,获取该字符的拼音;模糊音生成模块303,配置为根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;训练执行模块304,配置为利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。The pinyin acquisition module 302 is configured to obtain the pinyin of each character in the sample text; the fuzzy sound generation module 303 is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text; The training execution module 304 is configured to use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
在本说明书装置的一个实施例中,模糊音生成模块303被配置为执行如下操作:针对样本文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;如果否,则将该拼音直接作为该字符 对应的模糊音;如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。In one embodiment of the device of this description, the fuzzy sound generation module 303 is configured to perform the following operations: for the pinyin of each character in the sample text, determine whether the pinyin includes a first pronunciation; the first pronunciation matches: a first pronunciation The pronunciation of the second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be used directly as the character The corresponding fuzzy sound; if so, replace the first pronunciation in the pinyin with the second pronunciation, and the pinyin obtained after the replacement is used as the corresponding fuzzy sound of the character.
在本说明书装置的另一个实施例中,模糊音生成模块303被配置为执行如下操作:将样本文本中每一个字符的拼音拆分成声母以及韵母;针对拆分出的声母,判断声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。In another embodiment of the device described in this specification, the fuzzy sound generation module 303 is configured to perform the following operations: split the pinyin of each character in the sample text into initial consonants and finals; for the split initial consonants, determine whether the initial consonants are in Including the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the second pronunciation corresponding to the first pronunciation to obtain the The fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sounds corresponding to the finals of the character; if so, add the finals to The first pronunciation of is replaced with the second pronunciation corresponding to the first pronunciation to obtain the fuzzy sound corresponding to the final of the character.
在本说明书装置的一个实施例中,模糊音生成模块303被配置为执行如下操作:针对拆分出的韵母,在判断韵母中是否包括第一发音之前,判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;判断删除韵头后的韵母中是否包括第一发音。In one embodiment of the device of this specification, the fuzzy sound generation module 303 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the beginning and the end of the rhyme. If If yes, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.
在本说明书装置的一个实施例中,训练执行模块304配置为执行:针对该样本文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;按照每一个字符在样本文本中的顺序,依次将各字符对应的三元组以及所述标签输入待训练的所述模糊音识别模型In one embodiment of the device of this description, the training execution module 304 is configured to execute: for each character in the sample text, generate a triplet corresponding to the character, the triplet including: the character, the initial consonant of the character The corresponding fuzzy sound and the fuzzy sound corresponding to the final of the character; according to the order of each character in the sample text, the triplet corresponding to each character and the label are input into the fuzzy sound recognition model to be trained.
在本说明书装置的一个实施例中,样本文本的标签包括:从情绪维度、所属领域维度、标的物维度、文本含义维度中的至少一个维度给出的标签。In one embodiment of the apparatus of this specification, the labels of the sample text include: labels given from at least one dimension among the emotion dimension, the domain dimension, the subject matter dimension, and the text meaning dimension.
在本说明书的一个实施例中,提供了一种语音含义的理解装置,参见图4,包括:语音识别结果接收模块401,配置为得到第一文本;该第一文本是对语音进行语音识别后生成的;字符拼音生成模块402,配置为对第一文本中的每一个字符,获取该字符的拼音;字符模糊音生成模块403,配置为根据每一个字符的拼音,得到每一个字符对应的模糊音;输入模块404,配置为将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;语音含义理解模块405,配置为对第二文本进行理解,得到所述语音的含义。In one embodiment of this specification, a device for understanding speech meaning is provided. See Figure 4, which includes: a speech recognition result receiving module 401 configured to obtain a first text; the first text is obtained after speech recognition of speech. Generated; the character pinyin generation module 402 is configured to obtain the pinyin of each character in the first text; the character fuzzy sound generation module 403 is configured to obtain the fuzzy corresponding to each character based on the pinyin of each character. sound; the input module 404 is configured to input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model to obtain the second text output by the fuzzy sound recognition model; the speech meaning understanding module 405 is configured as The second text is understood to obtain the meaning of the speech.
在本说明书一个实施例的语音含义的理解装置中,字符模糊音生成模块403配置为执行如下操作:针对第一文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;如果否,则将该拼音直接作为该字符对应的模糊音;如果是,则将该拼音中的第一发音替换为第二发音, 替换后得到的拼音作为该字符对应的模糊音。In the speech meaning understanding device of one embodiment of this specification, the character fuzzy sound generation module 403 is configured to perform the following operations: for the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; The pronunciation matches: the pronunciation of a second pronunciation will be confused with the pronunciation of the first pronunciation; if not, the pinyin will be directly used as the fuzzy pronunciation corresponding to the character; if it is, the first pronunciation in the pinyin will be replaced by the pronunciation of the first pronunciation. Second pronunciation, The pinyin obtained after replacement is used as the fuzzy pronunciation corresponding to the character.
在本说明书另一个实施例的语音含义的理解装置中,字符模糊音生成模块403配置为执行如下操作:将第一文本中每一个字符的拼音拆分成声母以及韵母;针对拆分出的声母,判断该声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。In another embodiment of the present specification, the phonetic meaning understanding device, the character fuzzy sound generation module 403 is configured to perform the following operations: split the pinyin of each character in the first text into initial consonants and final consonants; , determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if yes, replace the first pronunciation in the initial consonant with the first pronunciation corresponding to the first pronunciation. The second pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character; for the separated finals, determine whether the finals include the first pronunciation; if not, the final is directly used as the fuzzy sound corresponding to the finals of the character; if so , then the first pronunciation in the final is replaced with the second pronunciation corresponding to the first pronunciation, so as to obtain the fuzzy sound corresponding to the final of the character.
在本说明书实施例的语音含义的理解装置中,字符模糊音生成模块403配置为执行如下操作:针对拆分出的韵母,在判断韵母中是否包括第一发音之前,判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;判断该删除韵头后的韵母中是否包括第一发音。In the phonetic meaning understanding device of the embodiment of this specification, the character fuzzy sound generation module 403 is configured to perform the following operations: for the separated finals, before determining whether the finals include the first pronunciation, determine whether the finals include the rhyme and the final pronunciation. At the end of the rhyme, if so, delete the rhyme in the final; determine whether the final after deleting the rhyme includes the first pronunciation.
在本说明书实施例的语音含义的理解装置中,输入模块404配置为执行如下操作:针对该第一文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;按照每一个字符在第一文本中的顺序,依次将各字符对应的三元组输入所述模糊音识别模型。In the speech meaning understanding device of the embodiment of this specification, the input module 404 is configured to perform the following operations: for each character in the first text, generate a triplet corresponding to the character, where the triplet includes: the character , the fuzzy sound corresponding to the initial consonant of the character and the fuzzy sound corresponding to the final vowel of the character; according to the order of each character in the first text, the triplets corresponding to each character are sequentially input into the fuzzy sound recognition model.
本说明书一个实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行说明书中任一个实施例中的方法。One embodiment of the present specification provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the method in any embodiment of the specification.
本说明书一个实施例提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现执行说明书中任一个实施例中的方法。One embodiment of this specification provides a computing device, including a memory and a processor. The memory stores executable code. When the processor executes the executable code, it implements any of the embodiments in the specification. method.
可以理解的是,本说明书实施例示意的结构并不构成对本说明书实施例的装置的具体限定。在说明书的另一些实施例中,上述装置可以包括比图示更多或者更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或者软件和硬件的组合来实现。It can be understood that the structures illustrated in the embodiments of this specification do not constitute specific limitations on the devices of the embodiments of this specification. In other embodiments of the specification, the above-mentioned device may include more or less components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently. The components illustrated may be implemented in hardware, software, or a combination of software and hardware.
上述装置、系统内的各模块之间的信息交互、执行过程等内容,由于与本说明书方法实施例基于同一构思,具体内容可参见本说明书方法实施例中的叙述,此处不再赘述。The information interaction, execution process, etc. between the above-mentioned devices and modules in the system are based on the same concept as the method embodiments in this specification. For details, please refer to the description in the method embodiments in this specification, and will not be described again here.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的 部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, and the same and similar features among the various embodiments Parts may refer to each other, and each embodiment focuses on its differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本公开所描述的功能可以用硬件、软件、挂件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should realize that in one or more of the above examples, the functions described in this disclosure can be implemented using hardware, software, plugins, or any combination thereof. When implemented using software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
以上所述的具体实施方式,对本公开的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本公开的具体实施方式而已,并不用于限定本公开的保护范围,凡在本公开的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本公开的保护范围之内。 The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present disclosure in detail. It should be understood that the above-mentioned are only specific embodiments of the present disclosure and are not intended to limit the scope of the present disclosure. Protection scope: Any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical solution of the present disclosure shall be included in the protection scope of the present disclosure.

Claims (15)

  1. 一种模糊音识别模型的训练方法,包括:A training method for a fuzzy sound recognition model, including:
    得到包括多个字符的具有语义的样本文本;Obtain a semantic sample text including multiple characters;
    对样本文本中的每一个字符,获取该字符的拼音;For each character in the sample text, obtain the pinyin of the character;
    根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;According to the pinyin of each character in the sample text, the fuzzy sound corresponding to each character is obtained;
    利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。The fuzzy sound recognition model is trained using the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text.
  2. 根据权利要求1所述的方法,其中,所述根据样本文本中的每一个字符的拼音得到每一个字符对应的模糊音,包括:The method according to claim 1, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the sample text includes:
    针对样本文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;For the pinyin of each character in the sample text, determine whether the pinyin includes the first pronunciation; the first pronunciation meets: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation;
    如果否,则将该拼音直接作为该字符对应的模糊音;If not, use the pinyin directly as the fuzzy sound corresponding to the character;
    如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。If so, the first pronunciation in the pinyin is replaced with the second pronunciation, and the resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
  3. 根据权利要求1所述的方法,其中,所述根据样本文本中的每一个字符的拼音得到每一个字符对应的模糊音,包括:The method according to claim 1, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the sample text includes:
    将样本文本中每一个字符的拼音拆分成声母以及韵母;Split the pinyin of each character in the sample text into initial consonants and finals;
    针对拆分出的声母,判断声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;For the split initial consonant, determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character;
    针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation in the finals with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
  4. 根据权利要求3所述的方法,其中,针对拆分出的韵母,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;The method according to claim 3, wherein, for the separated finals, before determining whether the finals include the first pronunciation, further comprising: determining whether the finals include the beginning and the end of the rhyme, and if so, deleting the final rhyme; rhyme;
    所述判断韵母中是否包括第一发音包括:判断删除韵头后的韵母中是否包括第一发音。Determining whether the final pronunciation includes the first pronunciation includes: determining whether the final pronunciation after deleting the final pronunciation includes the first pronunciation.
  5. 根据权利要求3所述的方法,其中,所述训练所述模糊音识别模型,包括:The method according to claim 3, wherein training the fuzzy sound recognition model includes:
    针对该样本文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音; For each character in the sample text, a triplet corresponding to the character is generated. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character;
    按照每一个字符在样本文本中的顺序,依次将各字符对应的三元组以及所述标签输入待训练的所述模糊音识别模型。According to the order of each character in the sample text, the triplet corresponding to each character and the label are sequentially input into the fuzzy sound recognition model to be trained.
  6. 根据权利要求1所述的方法,所述样本文本的标签包括:从情绪维度、所属领域维度、标的物维度、文本含义维度中的至少一个维度给出的标签。According to the method of claim 1, the label of the sample text includes: a label given from at least one dimension of the emotion dimension, the domain dimension, the subject dimension, and the text meaning dimension.
  7. 一种语音含义的理解方法,包括:A method of understanding the meaning of speech, including:
    得到第一文本;该第一文本是对语音进行语音识别后生成的;Obtain the first text; the first text is generated after performing speech recognition on the speech;
    对第一文本中的每一个字符,获取该字符的拼音;For each character in the first text, obtain the pinyin of the character;
    根据每一个字符的拼音,得到每一个字符对应的模糊音;According to the pinyin of each character, the fuzzy sound corresponding to each character is obtained;
    将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;Input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second text output by the fuzzy sound recognition model;
    对第二文本进行理解,得到所述语音的含义。The second text is understood to obtain the meaning of the speech.
  8. 根据权利要求7所述的方法,其中,所述根据第一文本中每一个字符的拼音得到每一个字符对应的模糊音,包括:The method according to claim 7, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character in the first text includes:
    针对第一文本中每一个字符的拼音,判断该拼音中是否包括第一发音;该第一发音符合:一个第二发音的读音会被混淆为第一发音的读音;For the pinyin of each character in the first text, determine whether the pinyin includes the first pronunciation; the first pronunciation meets: the pronunciation of a second pronunciation will be confused as the pronunciation of the first pronunciation;
    如果否,则将该拼音直接作为该字符对应的模糊音;If not, use the pinyin directly as the fuzzy sound corresponding to the character;
    如果是,则将该拼音中的第一发音替换为第二发音,替换后得到的拼音作为该字符对应的模糊音。If so, the first pronunciation in the pinyin is replaced with the second pronunciation, and the resulting pinyin is used as the fuzzy pronunciation corresponding to the character.
  9. 根据权利要求7所述的方法,其中,所述根据每一个字符的拼音得到每一个字符对应的模糊音,包括:The method according to claim 7, wherein obtaining the fuzzy pronunciation corresponding to each character according to the pinyin of each character includes:
    将第一文本中每一个字符的拼音拆分成声母以及韵母;Split the pinyin of each character in the first text into initial consonants and finals;
    针对拆分出的声母,判断该声母中是否包括第一发音;如果否,则将该声母直接作为该字符的声母对应的模糊音;如果是,则将该声母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的声母对应的模糊音;For the split initial consonant, determine whether the initial consonant includes the first pronunciation; if not, use the initial consonant directly as the fuzzy sound corresponding to the initial consonant of the character; if so, replace the first pronunciation in the initial consonant with this The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the initial consonant of the character;
    针对拆分出的韵母,判断韵母中是否包括第一发音;如果否,则将该韵母直接作为该字符的韵母对应的模糊音;如果是,则将该韵母中的第一发音替换为该第一发音对应的第二发音,以得到该字符的韵母对应的模糊音。For the separated finals, determine whether the finals include the first pronunciation; if not, use the finals directly as the fuzzy sound corresponding to the finals of the character; if so, replace the first pronunciation in the finals with the first pronunciation. The second pronunciation corresponding to the first pronunciation is used to obtain the fuzzy sound corresponding to the final of the character.
  10. 根据权利要求9所述的方法,其中,针对拆分出的韵母,在判断韵母中是否包括第一发音之前,进一步包括:判断韵母是否包括韵头及韵尾,如果是,则删除该韵母中的韵头;The method according to claim 9, wherein, for the separated finals, before determining whether the finals include the first pronunciation, the method further includes: determining whether the finals include the beginning and the end of the rhyme, and if so, deleting the finals in the finals. rhyme; rhyme;
    判断韵母中是否包括第一发音包括:判断该删除韵头后的韵母中是否包括第一发音。 Determining whether the final sound includes the first pronunciation: determining whether the final sound after deleting the final sound includes the first pronunciation.
  11. 根据权利要求9所述的方法,其中,所述将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,包括:The method according to claim 9, wherein said inputting the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model includes:
    针对该第一文本中的每一个字符,生成对应该字符的三元组,该三元组包括:该字符、该字符的声母对应的模糊音及该字符的韵母对应的模糊音;For each character in the first text, a triplet corresponding to the character is generated. The triplet includes: the character, the fuzzy sound corresponding to the initial consonant of the character, and the fuzzy sound corresponding to the final consonant of the character;
    按照每一个字符在第一文本中的顺序,依次将各字符对应的三元组输入所述模糊音识别模型。According to the order of each character in the first text, the triplet corresponding to each character is input into the fuzzy sound recognition model in turn.
  12. 一种模糊音识别模型的训练装置,包括:A training device for fuzzy sound recognition model, including:
    样本文本获取模块,配置为得到包括多个字符的具有语义的样本文本;A sample text acquisition module configured to obtain a semantic sample text including multiple characters;
    拼音获取模块,配置为对样本文本中的每一个字符,获取该字符的拼音;The pinyin acquisition module is configured to obtain the pinyin of each character in the sample text;
    模糊音生成模块,配置为根据样本文本中的每一个字符的拼音,得到每一个字符对应的模糊音;The fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character in the sample text;
    训练执行模块,配置为利用样本文本、该样本文本中每一个字符对应的模糊音以及该样本文本的标签,训练所述模糊音识别模型。The training execution module is configured to use the sample text, the fuzzy sound corresponding to each character in the sample text, and the label of the sample text to train the fuzzy sound recognition model.
  13. 一种语音含义的理解装置,包括:A device for understanding speech meaning, including:
    语音识别结果接收模块,配置为得到第一文本;该第一文本是对语音进行语音识别后生成的;A speech recognition result receiving module configured to obtain a first text; the first text is generated after speech recognition of speech;
    字符拼音生成模块,配置为对样本文本中的每一个字符,获取该字符的拼音;The character pinyin generation module is configured to obtain the pinyin of each character in the sample text;
    字符模糊音生成模块,配置为根据每一个字符的拼音,得到每一个字符对应的模糊音;The character fuzzy sound generation module is configured to obtain the fuzzy sound corresponding to each character based on the pinyin of each character;
    输入模块,配置为将第一文本、该第一文本中每一个字符对应的模糊音输入模糊音识别模型,得到模糊音识别模型输出的第二文本;The input module is configured to input the first text and the fuzzy sound corresponding to each character in the first text into the fuzzy sound recognition model, and obtain the second text output by the fuzzy sound recognition model;
    语音含义理解模块,配置为对第二文本进行理解,得到所述语音的含义。The speech meaning understanding module is configured to understand the second text and obtain the meaning of the speech.
  14. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-11中任一项所述的方法。A computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer is caused to execute the method described in any one of claims 1-11.
  15. 一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-11中任一项所述的方法。 A computing device includes a memory and a processor. The memory stores executable code. When the processor executes the executable code, the method according to any one of claims 1-11 is implemented.
PCT/CN2023/093289 2022-05-23 2023-05-10 Model training method and apparatus, and speech meaning understanding method and apparatus WO2023226767A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210561117.1A CN115019786A (en) 2022-05-23 2022-05-23 Model training method and device and speech meaning understanding method and device
CN202210561117.1 2022-05-23

Publications (1)

Publication Number Publication Date
WO2023226767A1 true WO2023226767A1 (en) 2023-11-30

Family

ID=83069173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/093289 WO2023226767A1 (en) 2022-05-23 2023-05-10 Model training method and apparatus, and speech meaning understanding method and apparatus

Country Status (2)

Country Link
CN (1) CN115019786A (en)
WO (1) WO2023226767A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019786A (en) * 2022-05-23 2022-09-06 支付宝(杭州)信息技术有限公司 Model training method and device and speech meaning understanding method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302795A (en) * 2015-11-11 2016-02-03 河海大学 Chinese text verification system and method based on Chinese vague pronunciation and voice recognition
CN109710929A (en) * 2018-12-18 2019-05-03 金蝶软件(中国)有限公司 A kind of bearing calibration, device, computer equipment and the storage medium of speech recognition text
WO2019096068A1 (en) * 2017-11-14 2019-05-23 蔚来汽车有限公司 Voice recognition and error correction method and voice recognition and error correction system
CN113378553A (en) * 2021-04-21 2021-09-10 广州博冠信息科技有限公司 Text processing method and device, electronic equipment and storage medium
CN113807080A (en) * 2020-06-15 2021-12-17 科沃斯商用机器人有限公司 Text correction method, text correction device and storage medium
CN115019786A (en) * 2022-05-23 2022-09-06 支付宝(杭州)信息技术有限公司 Model training method and device and speech meaning understanding method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302795A (en) * 2015-11-11 2016-02-03 河海大学 Chinese text verification system and method based on Chinese vague pronunciation and voice recognition
WO2019096068A1 (en) * 2017-11-14 2019-05-23 蔚来汽车有限公司 Voice recognition and error correction method and voice recognition and error correction system
CN109710929A (en) * 2018-12-18 2019-05-03 金蝶软件(中国)有限公司 A kind of bearing calibration, device, computer equipment and the storage medium of speech recognition text
CN113807080A (en) * 2020-06-15 2021-12-17 科沃斯商用机器人有限公司 Text correction method, text correction device and storage medium
CN113378553A (en) * 2021-04-21 2021-09-10 广州博冠信息科技有限公司 Text processing method and device, electronic equipment and storage medium
CN115019786A (en) * 2022-05-23 2022-09-06 支付宝(杭州)信息技术有限公司 Model training method and device and speech meaning understanding method and device

Also Published As

Publication number Publication date
CN115019786A (en) 2022-09-06

Similar Documents

Publication Publication Date Title
US10438586B2 (en) Voice dialog device and voice dialog method
KR101211796B1 (en) Apparatus for foreign language learning and method for providing foreign language learning service
CN111862977B (en) Voice conversation processing method and system
US10573315B1 (en) Tailoring an interactive dialog application based on creator provided content
US9805718B2 (en) Clarifying natural language input using targeted questions
WO2021027198A1 (en) Speech dialog processing method and apparatus
US20210280190A1 (en) Human-machine interaction
US10839788B2 (en) Systems and methods for selecting accent and dialect based on context
JP2019528512A (en) Human-machine interaction method and apparatus based on artificial intelligence
CN109545183A (en) Text handling method, device, electronic equipment and storage medium
US11093110B1 (en) Messaging feedback mechanism
CN111508479B (en) Voice recognition method, device, equipment and storage medium
WO2023226767A1 (en) Model training method and apparatus, and speech meaning understanding method and apparatus
US11907665B2 (en) Method and system for processing user inputs using natural language processing
CN116821290A (en) Multitasking dialogue-oriented large language model training method and interaction method
CN109065019B (en) Intelligent robot-oriented story data processing method and system
Araki et al. Spoken dialogue system for learning Braille
CN115408500A (en) Question-answer consistency evaluation method and device, electronic equipment and medium
CN113539245B (en) Language model automatic training method and system
CN114398876B (en) Text error correction method and device based on finite state converter
US11875699B2 (en) Methods for online language learning using artificial intelligence and avatar technologies
Hüwel et al. Spontaneous speech understanding for robust multi-modal human-robot communication
CN113763961B (en) Text processing method and device
US20220028298A1 (en) Pronunciation teaching method
JP6790791B2 (en) Voice dialogue device and dialogue method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810840

Country of ref document: EP

Kind code of ref document: A1