WO2023077878A1 - Speech control method and apparatus, electronic device, and readable storage medium - Google Patents

Speech control method and apparatus, electronic device, and readable storage medium Download PDF

Info

Publication number
WO2023077878A1
WO2023077878A1 PCT/CN2022/107788 CN2022107788W WO2023077878A1 WO 2023077878 A1 WO2023077878 A1 WO 2023077878A1 CN 2022107788 W CN2022107788 W CN 2022107788W WO 2023077878 A1 WO2023077878 A1 WO 2023077878A1
Authority
WO
WIPO (PCT)
Prior art keywords
pinyin
content
pinyin content
description information
phoneme
Prior art date
Application number
PCT/CN2022/107788
Other languages
French (fr)
Chinese (zh)
Inventor
曾理
张晓帆
Original Assignee
杭州逗酷软件科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州逗酷软件科技有限公司 filed Critical 杭州逗酷软件科技有限公司
Publication of WO2023077878A1 publication Critical patent/WO2023077878A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of computer technology, and more specifically, to a voice control method, device, electronic equipment and readable storage medium.
  • the electronic device can receive voice control instructions issued by the user through an auditory mode to realize voice control of the electronic device.
  • voice control process there is still a problem that the probability of accurately executing the voice control needs to be improved.
  • the present application proposes a voice control method, device, electronic equipment and readable storage medium, so as to improve the above problems.
  • the present application provides a voice control method, the method comprising: acquiring a first pinyin content and acquiring a plurality of second pinyin content, the first pinyin content being the pinyin content corresponding to the acquired voice control instruction , the plurality of second pinyin contents include pinyin contents of descriptive information to be selected, and the descriptive information is information used to describe corresponding operations; when the second pinyin contents fail to match with the first pinyin contents, the third Pinyin content, the third pinyin content is pinyin content similar to the first pinyin content; the third pinyin content is matched with the plurality of second pinyin content, and the corresponding second pinyin content is matched with The description information successfully matched by the third pinyin content is used as the target description information; and the control operation corresponding to the target description information is executed.
  • the present application provides a voice control device, which includes: a first pinyin content and a second pinyin content acquiring unit, configured to acquire the first pinyin content and a plurality of second pinyin content, the first The first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation; the third pinyin content The obtaining unit is used to obtain the third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content; the pinyin content matching unit, For matching the third pinyin content with the plurality of second pinyin content, and using the description information successfully matching the corresponding second pinyin content with the third pinyin content as the target description information; the control operation execution unit , for executing a control operation corresponding to the target description
  • the present application provides an electronic device, including one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, The one or more programs are configured to perform the methods described above.
  • the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, wherein the above method is executed when the program code is running.
  • the present application provides a computer program product, including a computer program/instruction, which implements the steps of the above method when the computer program/instruction is executed by a processor.
  • FIG. 1 shows a schematic diagram of an application scenario of a voice control method proposed in an embodiment of the present application
  • FIG. 2 shows a schematic diagram of an application scenario of another voice control method proposed in the embodiment of the present application
  • FIG. 3 shows a flow chart of a voice control method proposed in an embodiment of the present application
  • FIG. 4 shows a flowchart of an embodiment of S120 in FIG. 3 of the present application
  • FIG. 5 shows a flow chart of a voice control method proposed in another embodiment of the present application.
  • FIG. 6 shows a flowchart of an embodiment of S230 in FIG. 5 of the present application
  • Fig. 7 shows a schematic diagram of obtaining the first alternative pinyin content corresponding to each phoneme pair proposed by the present application
  • Fig. 8 shows a schematic diagram of obtaining the second alternative pinyin content corresponding to each specified phoneme proposed by the present application
  • FIG. 9 shows a flow chart of a voice control method proposed in another embodiment of the present application.
  • FIG. 10 shows a flowchart of an embodiment of S340 in FIG. 9 of the present application.
  • FIG. 11 shows a flowchart of an embodiment of S350 in FIG. 9 of the present application.
  • FIG. 12 shows a schematic diagram of the implementation process of a voice control method proposed in this application.
  • FIG. 13 shows a structural block diagram of a voice control device proposed in the embodiment of the present application.
  • Fig. 14 shows a structural block diagram of an electronic device proposed by the present application
  • Fig. 15 is a storage unit for storing or carrying program codes for realizing the voice control method according to the embodiment of the present application according to the embodiment of the present application.
  • ASR Automatic Speech Recognition
  • the user's voice control command is recognized as a string of similar sounds, such as: "swipe up” is recognized as “shaohua”, “swipe up” is recognized as “joke” and so on.
  • the inventor proposes a voice control method, device, electronic equipment, and computer program product in the present application.
  • the method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and acquires the pinyin content of the descriptive information to be selected.
  • the content is used as a plurality of second pinyin content, if it is determined that there is no second pinyin content successfully matched with the first pinyin content, then obtain the pinyin content similar to the first pinyin content as the third pinyin content, and then use the third pinyin content
  • the pinyin content is matched with the multiple second pinyin content, and the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as the target description information, and the control operation corresponding to the target description information is executed.
  • the directly converted audio content after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion.
  • the similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.
  • the provided voice control method may be executed by an electronic device.
  • all the steps in the voice control method provided in the embodiment of the present application may be executed by the electronic device.
  • the voice collection device of the electronic device 100 can collect voice control instructions, and transmit the collected voice collection instructions and descriptive information to be selected to the processor, so that the processor can obtain the first Pinyin content and obtaining a plurality of second pinyin contents, and then the processor reuses the obtained first pinyin content, obtains a plurality of second pinyin contents and pinyin content (third pinyin content) similar to the first pinyin content to execute the application
  • the voice collection device of the electronic device 100 can collect voice control instructions, and transmit the collected voice collection instructions and descriptive information to be selected to the processor, so that the processor can obtain the first Pinyin content and obtaining a plurality of second pinyin contents, and then the processor reuses the obtained first pinyin content, obtains a plurality of second pinyin contents and pinyin content (third pinyin content) similar to the first pinyin
  • the voice control method provided in the embodiment of the present application may also be executed by a server.
  • the electronic device can collect voice control instructions, and send the collected voice control instructions to the server synchronously, and then the server will execute the voice control method provided by the embodiment of the application.
  • the target description information is determined, and then the server generates an operation instruction according to the target description information.
  • it can also be executed cooperatively by the electronic device and the server. In the way that the electronic device and the server cooperate to execute, some steps in the voice control method provided by the embodiment of the present application are executed by the electronic device, while other parts of the steps are executed by the server.
  • the electronic device 100 may execute the voice control method including: acquiring first pinyin content and acquiring a plurality of second pinyin content, and then the server 200 performs subsequent steps.
  • the steps performed by the electronic device and the server respectively are not limited to the method described in the above examples. In practical applications, the electronic device can be dynamically adjusted according to the actual situation Steps performed by the device and the server respectively.
  • a voice control method provided by the present application, the method includes:
  • S110 Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
  • the user can express his control intention through voice.
  • the electronic device may use the voice uttered by the user as a voice control instruction.
  • the command here refers to the command for the user to manipulate the interactive interface or the elements on the interactive interface.
  • the voice control command may include: swipe left, open today’s headlines, bilibili, play XXX, second row The third one, return, swipe up, install Douyin, next song, desktop, etc.
  • the first pinyin content can be obtained through Automatic Speech Recognition (ASR) technology and Natural Language Processing (NLP) technology.
  • ASR Automatic Speech Recognition
  • NLP Natural Language Processing
  • the electronic device can transmit the user's voice control command to the ASR module to obtain the command text corresponding to the voice control command, and then use the pinyin content corresponding to the command text as the first pinyin content.
  • the NLP module can also be used to extract the user intent, control object, and object attachment information in the instruction text, and integrate them into a triple in the form of ⁇ action, object, information ⁇ , where action represents User intent, object represents the control object, and information represents the object's auxiliary information.
  • user intent refers to the action the user wants to perform, such as: click, swipe, long press, etc.
  • Auxiliary information refers to the information that may accompany the control object.
  • the text box is the control object
  • the text to be filled is the auxiliary information.
  • the control object and auxiliary information are not necessarily mandatory.
  • the pinyin corresponding to the control object in the triplet can be used as the first pinyin content, if three If the control object of the tuple is empty, the content corresponding to the user's intention can be used as the first pinyin content.
  • the user's voice control instruction can be "Open Toutiao", and the triplet that can be obtained through the ASR module and the NLP module is: ⁇ click, Toutiao, ⁇ , where the user's intention is: "click",
  • the control object is "Today's Headlines”, and the object's auxiliary information is empty, then the first pinyin content is "jin ri tou tiao”.
  • the user instruction can be "swipe up”
  • the triplet that can be obtained through the ASR module and the NLP module is: ⁇ swipe up, ⁇ , ⁇ , where the user's intention is "swipe up”, and the control object is empty. If the auxiliary information of the object is also empty, the first pinyin content is "shang hua".
  • the descriptive information to be selected may be a collection of descriptive information of operations that the electronic device can perform when the voice control instruction is acquired.
  • the operations that can be performed by the electronic device may be operations performed on the entire electronic device, for example, shutting down, switching operation modes, or taking pictures.
  • the operations that can be performed by the electronic device may include operations performed on the target interface.
  • the target interface may be the interface currently displayed by the electronic device.
  • the descriptive information to be selected may include the respective descriptive information of multiple controls in the target interface, for example: “Fenghuo Kangda", “Olympic Highlights", “Lonely The Eighth Season of Gourmet", etc.
  • the descriptive information to be selected can also include descriptive information corresponding to all the overall interface operation commands, such as: swipe left, swipe right, swipe up, swipe down, return, desktop, double-click, long press, etc.
  • the second pinyin content may be acquired by acquiring the pinyin content corresponding to all the description information to be selected.
  • description information of multiple controls included in the target interface may be acquired as candidate description information, and then the candidate description information may be converted into corresponding pinyin content to obtain multiple second pinyin content.
  • description information corresponding to all interface overall operation instructions may also be obtained as candidate description information, and then the candidate description information is converted into corresponding pinyin content to obtain multiple second pinyin content.
  • the plurality of second pinyin contents may also include pinyin contents corresponding to the description information corresponding to the overall operation instruction of the interface, and pinyin contents corresponding to the respective description information of the multiple controls included in the target interface.
  • the description information of multiple controls included in the target interface can be obtained through the system program.
  • the electronic device can use the system program to analyze the code corresponding to the target interface, and can obtain each Information such as the type, position, and size of a control is used as the description information of the control.
  • pinyin corresponding to the text, for example: pypinyin, xpinyin in the Python library, pinyin4J in the Java library, etc. You can choose which method to use to implement the operation of converting text to pinyin according to the actual development environment .
  • S120 Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.
  • the first pinyin content and the second pinyin content are obtained, it may be detected whether there is a second pinyin content that successfully matches the first pinyin content among the plurality of second pinyin content.
  • the content of the second pinyin is completely the same as the content of the first pinyin, it is determined that the content of the second pinyin matches the content of the first pinyin successfully.
  • first pinyin content is “shao hua”
  • second pinyin content currently matching “shao hua” is “shang hua”
  • "ao" in the first pinyin content and the second pinyin content If the "ang” in is different, it is determined that the first pinyin content "shao hua" does not match the second pinyin content "shang hua”.
  • each of the phoneme correspondences represents a pair of similar phonemes;
  • the phoneme corresponding to the phoneme is used as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
  • a phoneme (phone) is the smallest unit of speech divided according to the natural properties of speech, and a pronunciation action forms a phoneme.
  • phonemes can be divided into initials and finals.
  • i such as: i, ia, ie, iao, iou, ian, in, iang, iong, etc.
  • yi when y is added in front of the final i and the compound finals beginning with i (such as: i, ia, ie, iao, iou, ian, in, iang, iong, etc.), it can be recorded as yi .
  • yue, yuan, yun, ju, qu, xu and when the initial consonant corresponding to the final ü is n, l, it can be written as nü, lü. Therefore, in some cases, u can be used instead of ü.
  • a phoneme expansion table as shown in Table 1 can be formed by combining the notation rules of Chinese Pinyin and common mistakes in Chinese pronunciation.
  • the phonemes included in the first pinyin content are: sh, ao, h, ua
  • the phoneme expansion table [sh, s], [sh, c], [sh, xi], [sh, zh], [ao, ou], [ao, iao ], [ao, ang], [h, f], then sh, ao, and h can be used as the specified phoneme, and based on the above-mentioned phoneme correspondence, the similar phonemes corresponding to the specified phoneme can be determined as: s, c, xi, zh, ou , iao, ang, f.
  • the pinyin content similar to the first pinyin content can also be obtained as the third pinyin content as a whole.
  • the features of the pinyin content corresponding to multiple words can be directly obtained in advance as reference features.
  • the features of the first pinyin content can be obtained in the same way, and then the first pinyin
  • the features of the content are compared with the pre-acquired reference features, and the pinyin content corresponding to the successfully compared reference features is used as the third pinyin content.
  • the reference feature of the successful comparison is the same as the feature of the first pinyin content.
  • the related methods of acquiring data features can be applied to acquire features of Pinyin content, and the specific way of acquiring features of Pinyin content is not specifically limited in this embodiment of the present application.
  • the features of pinyin content can be obtained by means of text vectors.
  • S130 Match the third pinyin content with the plurality of second pinyin content, and use the description information that the corresponding second pinyin content successfully matches with the third pinyin content as target description information.
  • the content of the third pinyin can be: ⁇ "sao hua", “cao hua”, “xiao hua”, “zhao hua”, “shou hua”, “shiao hua”, “shang hua”, “shao fua “ ⁇
  • the content of the second pinyin can be: ⁇ "feng huo kang da (Fenghuo Kangda)", “ao yun ji jin (Olympic collection)", “gu du de mei shi jia di ba ji (the eighth Season)",..., “zuo hua”, “you hua”, “shang hua”, “xia hua”, “fan hui”, “zhuo mian”, “shuang ji”, “chang an” ⁇ , then Match the content of the above third pinyin with the content of the second pinyin to obtain the target description information "shang hua
  • S140 Execute a control operation corresponding to the target description information.
  • the target description information can be the description information corresponding to the control in the target interface, and can be combined with the user intent and object attachment information in the triple group corresponding to the control to which the target description information belongs, in the way of event injection or simulated click Execute the control operation corresponding to the target description information on the electronic device.
  • the target description information is "sou suo kuang”
  • the target description information may be description information corresponding to an overall interface operation instruction. For example: if the target description information is "shang hua", the operation of swiping up can be directly performed on the electronic device.
  • the method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and the pinyin content of the descriptive information to be selected as multiple second pinyin content, if it is determined that there is no second pinyin content The second pinyin content is successfully matched with the first pinyin content, and then the pinyin content similar to the first pinyin content is obtained as the third pinyin content, and then the third pinyin content is matched with the plurality of second pinyin content, Using the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information, the control operation corresponding to the target description information is executed.
  • the directly converted audio content after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion.
  • the similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.
  • the initial consonant and final consonant confusion expansion table is established, and the pinyin that cannot be accurately matched is fuzzy expanded, and then matched, thereby also solving the problem.
  • the problem of homonym errors in the speech recognition process can also effectively solve the speech recognition errors caused by the user's non-standard pronunciation.
  • a voice control method provided by the present application, the method includes:
  • S210 Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
  • S220 Obtain a similar phoneme corresponding to a specified phoneme in the first pinyin content.
  • the specified phonemes in the first pinyin content can be replaced with multiple similar phonemes, respectively, to obtain the first pinyin content after phoneme replacement corresponding to the multiple similar phonemes, as The content of the third pinyin.
  • the first pinyin content can be "shao hua", then it can be seen from Table 1 that the specified phonemes of the first pinyin content "shao hua" can be sh, ao, h, wherein, the similar phoneme corresponding to sh is s , c, xi, zh, the similar phonemes corresponding to ao are ou, iao, ang, and the similar phonemes corresponding to h are f.
  • S231 Combine similar phonemes corresponding to at least two specified phonemes with each other to obtain multiple phoneme pairs, where each phoneme pair includes a similar phoneme corresponding to each of the at least two specified phonemes.
  • similar phonemes corresponding to at least two specified phonemes may be combined with each other according to the combination manner shown in FIG. 7 .
  • designated phoneme A corresponds to similar phonemes O
  • P, Q and designated phoneme B corresponds to similar phonemes R, S, T
  • the first pinyin content is ABC
  • each similar phoneme of designated phoneme A can be combined with All similar phonemes of the specified phoneme B are combined one by one to obtain the following phoneme pairs: OR, OS, OT, PR, PS, PT, QR, QS, QT.
  • the first pinyin content can be "shao hua", and the first pinyin content "shao hua” can be selected to be combined with similar phonemes corresponding to sh and ao in the specified phonemes in the combination manner shown in Figure 5, The following phoneme pairs are obtained: sou, siao, sang, cou, ciao, cang, xiou, ..., zhang.
  • S232 Respectively replace the specified phonemes corresponding to the first pinyin content based on the plurality of phonemes to obtain the first replaced pinyin content corresponding to each phoneme pair.
  • the corresponding specified phoneme in the first pinyin content ABC is replaced, and the first replacement pinyin content that can be obtained is: ORC, OSC, OTC, PRC, PSC, PTC, QRC, QSC, QTC.
  • the phoneme pair is sou, the corresponding first replacement pinyin content is "sou hua", and if the phoneme pair is cang, the corresponding first replacement pinyin content is "cang hua".
  • S233 Replace the corresponding designated phonemes in the first pinyin content with similar phonemes corresponding to the plurality of designated phonemes to obtain a second replaced pinyin content corresponding to each designated phoneme.
  • the specified phonemes in the first pinyin content may be replaced in the manner shown in FIG. 8 to obtain the second replaced pinyin content corresponding to each specified phoneme.
  • the specified phoneme A corresponds to similar phonemes O, P, Q
  • the specified phoneme B corresponds to similar phonemes R, S, T
  • the first pinyin content is ABC
  • the specified phoneme B with similar phonemes R, S, T to obtain the second replacement pinyin content ARC corresponding to the specified phoneme B , ASC, ATC.
  • the first pinyin content can be the specified phoneme of "shao hua” as sh, ao, h
  • the second replacement pinyin content corresponding to sh is ⁇ "sao hua", “cao hua”, “xiao hua”, “zhao hua” ⁇
  • the content of the second alternate pinyin corresponding to ao is ⁇ "shou hua
  • "shang hua” the second alternate pinyin content corresponding to h is ⁇ "shao fua” ⁇ .
  • S234 Use the first replaced pinyin content and the second replaced pinyin content as the third pinyin content.
  • the similarity of the first pinyin content can be further expanded, so that the content of the first pinyin can be compared with the second
  • the scope for matching the pinyin content is further expanded, thereby increasing the probability of successful matching.
  • S240 Match the third pinyin content with the multiple second pinyin content, and use the description information of the corresponding second pinyin content successfully matched with the third pinyin content as target description information.
  • the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method.
  • the corresponding similar pinyin content can be obtained based on the directly converted voice content and matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control.
  • the similar phonemes of the specified phonemes can be obtained by querying the phoneme expansion table, and the third pinyin content can be obtained by replacing multiple specified phonemes with similar phonemes in various ways, because the third pinyin content is in the first
  • the similar expansion based on the content of a pinyin increases the matching range, improves the probability of successful matching, and further increases the probability of accurately executing voice control.
  • a voice control method provided by the present application is applied to electronic equipment, and the method includes:
  • S310 Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
  • S320 Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.
  • S330 Match the third pinyin content with the multiple second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, match the corresponding second pinyin content with the third pinyin content successfully
  • the matching description information is used as the target description information.
  • obtaining the similarities between a plurality of second pinyin content and the first pinyin content respectively, so as to obtain the corresponding similarity of each second pinyin content may include:
  • S341 Acquire first reference similarities between multiple second pinyin contents and the first pinyin contents based on the longest common subsequence, so as to obtain a first reference similarity corresponding to each second pinyin content.
  • the first reference similarity between a plurality of second pinyin content and the first pinyin content can be measured by the longest common subsequence (Longest Common Subsequence, LCS), and the calculation formula of LCS can be:
  • a i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A.
  • B j can represent the first j characters of string B The value range of j is from 0 to the maximum length of string B, and a i and b j can represent the i-th and j-th characters in A and B, respectively.
  • character string A can be used to represent the first pinyin content
  • character string B can represent a second pinyin content
  • the length of the first pinyin content is 10
  • the length of the second pinyin content is 9, then the value range of i is 0 ⁇ 10
  • LCS similarity can be defined as:
  • can represent the lengths of strings A and B respectively, that is, the number of all characters in A and B.
  • the character string A may be "APPLE13", then
  • 7.
  • S342 Obtain second reference similarities between the plurality of second pinyin contents and the first pinyin contents based on edit distance, so as to obtain a second reference similarity corresponding to each second pinyin content.
  • the degree of difference between a plurality of second pinyin content and the first pinyin content can be measured by editing distance (Levenshtein Distance, LEV).
  • LEV Longshtein Distance
  • the calculation formula of LEV can be:
  • a i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A.
  • B j can represent the first j characters of string B The string formed by j is in the range of 0 to the maximum length of string B.
  • character string A can be used to represent the first pinyin content
  • character string B can represent a second pinyin content
  • the length of the first pinyin content is 10
  • the length of the second pinyin content is 9
  • the value range of i is 0 ⁇ 10
  • the value range of j is 0 ⁇ 9
  • LEV(A 10 ,B 9 ) min ⁇ LEV(A 9 ,B 10 )+1, LEV(A 10 ,B 9 )+1, LEV(A 9 ,B 8 ) ⁇
  • LEV(A 10 ,B 9 ) min ⁇ LEV(A 9 ,B 10 )+1,LEV(A 10 ,B 9 )+1,LEV (A 9 ,B 8 )+1 ⁇ .
  • the similarity corresponding to each second pinyin content can be obtained by directly adding the first reference similarity and the second reference similarity, and the calculation formula is as follows:
  • the weights corresponding to the first reference similarity and the second reference similarity can be assigned respectively, and the weights of the first reference similarity and the second reference similarity can be added to obtain each second pinyin content corresponding to The similarity is calculated by the following formula:
  • the descriptive information corresponding to the second pinyin content with the highest similarity is used as the target descriptive information, including:
  • abbreviations or abbreviations may appear in the user's voice control instructions, which may result in obtaining multiple most similar results by means of the longest common subsequence and edit distance.
  • the user's voice control command is "Fulian”
  • the second pinyin content set includes ⁇ "Avengers 4", "Copy a few couplets” ⁇ , "Fulian” and the longest common subtitle of the two objects to be matched
  • the sequences are all "multilinks" and the edit distance is 4, so the calculated similarities are the same, and a unique result cannot be determined.
  • the user's voice control command is "B station”
  • the second pinyin content set includes ⁇ , Q Music, A Cloud Music, B Music ⁇ , and no matching result can be obtained.
  • the similarities between the multiple second pinyin contents and the first pinyin contents can be measured based on the semantic similarity, so as to obtain the most similar second pinyin contents.
  • the text vector can be obtained through the pre-training model BERT.
  • BERT is a deep neural network that can input the text to be processed into the encoder part of BERT to obtain the corresponding text vector.
  • the text input corresponding to the first text vector may be the text content corresponding to the voice control instruction obtained through the ASR module, or may be the corresponding text content of the voice control instruction obtained through the ASR module and the NLP module.
  • the text content of the triplet can also be the text content corresponding to the third pinyin.
  • S353 Obtain multiple text vectors corresponding to the description information corresponding to the second pinyin content with the highest similarity, so as to obtain multiple second text vectors.
  • the text input corresponding to the second text vector may be the respective text description information of multiple controls in the target interface obtained through the system program, or may be the text description information of the overall operation instruction of the interface, For example: swipe left, swipe right, swipe up, swipe down, back, desktop, double click, long press, etc.
  • the text input corresponding to the text vector can be a Chinese character string or a pinyin string.
  • text vectors can also be obtained through tools such as Doc2Vec (document-to-vector), or open-source pre-training models such as RoBERTA, UniLM, ELECTRA, and XLNet.
  • Doc2Vec document-to-vector
  • open-source pre-training models such as RoBERTA, UniLM, ELECTRA, and XLNet.
  • S354 Calculate respectively the vector distances between the multiple second text vectors and the first text vector.
  • the vector distance between each second text vector and the first text vector is calculated by cosine similarity, and the calculation formula is as follows:
  • the magnitudes of the multiple vector distances can be sorted, and the description information corresponding to the second text vector with the smallest vector distance as the target description information.
  • the description information corresponding to the unique second text vector can be determined as the target Description.
  • the vector distance between multiple second text vectors and the first text vector can be calculated to obtain a unique matching result
  • the corresponding target description information so as to execute the control operation corresponding to the target description information, further improves the success rate of semantic recognition.
  • the text vector corresponding to the first pinyin content may also be obtained as the second pinyin content.
  • the text vector corresponding to the third pinyin content may also be acquired as the first text vector.
  • the multiple first text vectors obtained based on the third pinyin content include the first text vector L1, the first text vector L2, and the first text vector L3, the multiple second text vectors include the second text vector L4 and The second text vector L5.
  • the distance between the first text vector L1 and the second text vector L4 and the second text vector L5 will be calculated, and the distance between the first text vector L2 and the second text vector L4 and the second text vector L4 will be calculated respectively.
  • the distance between the vectors L5, and the distances between the first text vector L3 and the second text vector L4 and the second text vector L5 respectively.
  • the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method.
  • the corresponding similar pinyin content can be obtained based on the directly converted voice content to match the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control.
  • the similarities between multiple second pinyin content and the first pinyin content can be obtained respectively to obtain The similarity corresponding to each second pinyin content, and the description information corresponding to the second pinyin content with the highest similarity as the target description information, so as to solve the problem that the user's description of the interface control is difficult to match due to deletion or modification , and solve the problem of difficult matching caused by the user referring to the control through abbreviations and aliases, so that the control operation corresponding to the target description information can be performed, and the probability of accurately performing voice control is improved.
  • this patent scheme uses semantic similarity to match voice control instructions and description information, and vectorizes the instruction text to be matched (text converted from voice control instructions) through a large-scale pre-training model, and uses the vector
  • the matching can be done by using the similarity of the voice control command, which can solve the problem that the voice control command and the description information are quite different, but have the same meaning.
  • the first pinyin content can be matched with multiple second pinyin content, and the second pinyin content is matched with the first pinyin content
  • the description information of the corresponding second pinyin content and the first pinyin content can be successfully matched as the target description information, and the corresponding control operation of the target description information is executed; the second pinyin content and the first pinyin content are not successful
  • the operation of obtaining the content of the third pinyin can be performed.
  • the third pinyin content can be matched with multiple second pinyin content, if there is a successful match between the second pinyin content and the third pinyin content, then the corresponding second pinyin content and The description information of the successful matching of the third pinyin content is used as the target description information; when the second pinyin content is not successfully matched with the third pinyin content, step S4090 can be executed to obtain a plurality of second pinyin content and the first pinyin content respectively to obtain the similarity corresponding to each second pinyin content, and then use the description information corresponding to the second pinyin content with the highest similarity as the target description information, and execute the control operation corresponding to the target description information.
  • the reference similarities between multiple second pinyin content and the first pinyin content can be obtained based on the longest common subsequence and the edit distance, so as to obtain the reference similarity corresponding to each second pinyin content, if the corresponding similarity If there is one second pinyin content with the largest degree of similarity, the description information corresponding to the second pinyin content with the highest similarity degree is used as the target description information, and the corresponding control operation of the target description information is executed; if the corresponding second pinyin content with the highest degree of similarity If there are multiple, the text vector of the text content corresponding to the voice control instruction can be obtained as the first text vector and the text vector corresponding to the description information corresponding to the second pinyin content with the largest similarity to obtain multiple second text vectors , and then respectively calculate the vector distances between multiple second text vectors and the first text vector, so as to use the description information corresponding to the second text vector with the smallest corresponding vector distance as the target description information, and execute the control operation corresponding to
  • a voice control device 600 provided by the present application, the device 600 includes:
  • the first pinyin content and the second pinyin content acquisition unit 610 configured to acquire the first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple The second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation.
  • the third pinyin content acquiring unit 620 is configured to acquire a third pinyin content when the second pinyin content fails to match the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content .
  • a pinyin content matching unit 630 configured to match the third pinyin content with the multiple second pinyin content, and use the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information.
  • the control operation executing unit 640 is configured to execute the control operation corresponding to the target description information.
  • the first pinyin content and the second pinyin content acquisition unit 610 is specifically configured to acquire the description information of multiple controls included in the target interface as description information to be selected; convert the description information to be selected into corresponding Pinyin content to get multiple second pinyin content.
  • the third pinyin content acquisition unit 620 is specifically configured to acquire the similar phoneme corresponding to the specified phoneme in the first pinyin content; replace the specified phoneme in the first pinyin content with the similar phoneme to obtain The content of the third pinyin. Wherein, there are multiple similar phonemes.
  • the third pinyin content acquisition unit 620 is specifically used to replace the specified phonemes in the first pinyin content with multiple similar phonemes respectively, to obtain multiple Each of the similar phonemes corresponds to the first pinyin content after phoneme replacement, as the third pinyin content.
  • the third pinyin content acquisition unit 620 is specifically configured to combine similar phonemes corresponding to at least two specified phonemes to obtain a plurality of phoneme pairs, wherein each phoneme pair includes the at least two specified phonemes A similar phoneme corresponding to each phoneme; respectively, based on the plurality of phonemes, the corresponding specified phonemes in the first pinyin content are replaced to obtain the first replacement pinyin content corresponding to each phoneme pair;
  • the similar phoneme replaces the specified phoneme corresponding to the first pinyin content to obtain the second replacement pinyin content corresponding to each specified phoneme; the first replacement pinyin content and the second replacement pinyin content are used as the third Pinyin content.
  • the third pinyin content acquisition unit 620 is specifically used to inquire whether the phonemes included in the first pinyin content have corresponding phoneme correspondences in the phoneme extension table, and each of the phoneme correspondences represents a pair of similar The phoneme that has the phoneme corresponding relationship is determined as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
  • the pinyin content matching unit 630 is specifically configured to match the first pinyin content with multiple second pinyin content; Three pinyin content.
  • the pinyin content matching unit 630 is specifically configured to use the description information that the corresponding second pinyin content successfully matches the first pinyin content as the target description when the second pinyin content is successfully matched with the first pinyin content information; execute the control operation corresponding to the target description information.
  • the pinyin content matching unit 630 is specifically configured to match the third pinyin content with the plurality of second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, the corresponding The description information that successfully matches the second pinyin content and the third pinyin content is used as the target description information; when the second pinyin content fails to match the third pinyin content, obtain multiple second pinyin contents that The similarity of the first pinyin content is used to obtain the similarity corresponding to each second pinyin content; the description information corresponding to the second pinyin content with the highest similarity is used as the target description information.
  • the pinyin content matching unit 630 is specifically configured to obtain the first reference similarities between multiple second pinyin content and the first pinyin content based on the longest common subsequence, so as to obtain each second pinyin content
  • the first reference similarity corresponding to the content; the second reference similarity between multiple second pinyin content and the first pinyin content is obtained based on the edit distance, so as to obtain the second reference similarity corresponding to each second pinyin content degree; the first reference similarity corresponding to each second pinyin content and the second reference similarity are added together to obtain the similarity corresponding to each second pinyin content.
  • the pinyin content matching unit 630 is specifically configured to, if there is one second pinyin content with the highest similarity, use the description information corresponding to the second pinyin content with the highest similarity as the target description information; There are multiple second pinyin contents with the largest similarity, and the text vector of the text content corresponding to the voice control instruction is obtained as the first text vector; Text vectors to obtain a plurality of second text vectors; respectively calculate the vector distances between a plurality of second text vectors and the first text vector; use the descriptive information corresponding to a second text vector with the smallest corresponding vector distance as the target Description.
  • an embodiment of the present application also provides an electronic device 1000 capable of executing the aforementioned voice control method.
  • the electronic device 1000 includes one or more (only one is shown in the figure) processors 102 , a memory 104 , a camera 106 and an audio collection device 108 coupled to each other.
  • the memory 104 stores programs capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104 .
  • the processor 102 may include one or more processing cores.
  • the processor 102 uses various interfaces and circuits to connect various parts of the entire electronic device 1000, and executes or executes instructions, programs, code sets, or instruction sets stored in the memory 104, and calls data stored in the memory 104 to execute Various functions of the electronic device 1000 and processing data.
  • the processor 102 may adopt at least one of Digital Signal Processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA). implemented in the form of hardware.
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PLA Programmable Logic Array
  • the processor 102 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), a modem, and the like.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the CPU mainly handles the operating system, user interface and application programs, etc.
  • the GPU is used to render and draw the displayed content
  • the modem is used to handle wireless communication.
  • the processor 102 may be a neural network chip.
  • it may be an embedded neural network chip (NPU).
  • the memory 104 may include random access memory (Random Access Memory, RAM), and may also include read-only memory (Read-Only Memory). Memory 104 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. For example, a device may be stored in memory 104 . The device may be the aforementioned device 600 .
  • the memory 104 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like.
  • the electronic device 1000 may further include a network module 110 and a sensor module 112 in addition to the aforementioned devices.
  • the network module 110 is used to implement information interaction between the electronic device 1000 and other devices, for example, transmitting device control instructions, manipulation request instructions, and status information acquisition instructions. However, when the electronic device 200 is specifically a different device, its corresponding network module 110 may be different.
  • the sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: a level, a light sensor, a motion sensor, a pressure sensor, an infrared heat sensor, a distance sensor, an acceleration sensor, and other sensors.
  • the pressure sensor may be a sensor for detecting pressure generated by pressing on the electronic device 1000 . That is, the pressure sensor detects pressure generated by contact or press between the user and the electronic device, eg, contact or press between the user's ear and the mobile terminal. Therefore, the pressure sensor can be used to determine whether contact or pressure occurs between the user and the electronic device 1000, and the magnitude of the pressure.
  • the acceleration sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when it is still, and can be used to identify the application of electronic equipment 1000 attitude (such as horizontal and vertical screen switching, related games, magnetometer, etc.) Attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.
  • the electronic device 1000 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, and a thermometer, which will not be repeated here.
  • the audio collection device 110 is configured to collect audio signals.
  • the audio collection device 110 includes multiple audio collection devices, and the audio collection devices may be microphones.
  • the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is used to receive and send electromagnetic waves, realize mutual conversion between electromagnetic waves and electrical signals, and communicate with a communication network or other devices.
  • the radio frequency module may include various existing circuit elements for performing these functions, such as antenna, radio frequency transceiver, digital signal processor, encryption/decryption chip, Subscriber Identity Module (SIM) card, memory and so on.
  • SIM Subscriber Identity Module
  • the radio frequency module can interact with external devices by sending or receiving electromagnetic waves.
  • a radio frequency module can send instructions to a target device.
  • FIG. 15 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • Program codes are stored in the computer-readable storage medium 800, and the program codes can be invoked by a processor to execute the methods described in the foregoing method embodiments.
  • the computer readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium).
  • the computer-readable storage medium 800 has a storage space for program code 810 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • Program code 810 may, for example, be compressed in a suitable form.
  • the pinyin content corresponding to the voice control instruction is obtained as the first pinyin content and the pinyin content of the descriptive information to be selected is obtained as the first pinyin content.
  • the directly converted audio content after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion.
  • the similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

A speech control method and apparatus, an electronic device, and a readable storage medium. The method comprises: obtaining first Pinyin content, and obtaining a plurality of pieces of second Pinyin content, the first Pinyin content being Pinyin content corresponding to an obtained speech control instruction, the plurality of pieces of second Pinyin content comprising Pinyin content of description information to be selected, and the description information being information for describing a corresponding operation (S110); when the second Pinyin content and the first Pinyin content are not successfully matched, obtaining third Pinyin content, the third Pinyin content being Pinyin content similar to the first Pinyin content (S120); matching the third Pinyin content with the plurality of pieces of second Pinyin content, and using, as target description information, description information that a corresponding piece of second Pinyin content successfully matches the third Pinyin content (S130); and executing a control operation corresponding to the target description information (S140).

Description

语音控制方法、装置、电子设备以及可读存储介质Voice control method, device, electronic device and readable storage medium
相关申请的交叉引用Cross References to Related Applications
本申请要求于2021年11月03日提交的申请号为202111296079.3的中国申请的优先权,其在此出于所有目的通过引用将其全部内容并入本文。This application claims priority to Chinese application No. 202111296079.3 filed on November 03, 2021, which is hereby incorporated by reference in its entirety for all purposes.
技术领域technical field
本申请涉及计算机技术领域,更具体地,涉及一种语音控制方法、装置、电子设备以及可读存储介质。The present application relates to the field of computer technology, and more specifically, to a voice control method, device, electronic equipment and readable storage medium.
背景技术Background technique
结合人工智能技术以及虚拟个人助理(语音助手),可以使得电子设备通过听觉模态接收用户发出的语音控制指令而实现对电子设备进行语音控制。但是,在相关语音控制过程中,还存在准确执行语音控制的概率有待提升的问题。Combining artificial intelligence technology and a virtual personal assistant (voice assistant), the electronic device can receive voice control instructions issued by the user through an auditory mode to realize voice control of the electronic device. However, in the relevant voice control process, there is still a problem that the probability of accurately executing the voice control needs to be improved.
发明内容Contents of the invention
鉴于上述问题,本申请提出了一种语音控制方法、装置、电子设备以及可读存储介质,以实现改善上述问题。In view of the above problems, the present application proposes a voice control method, device, electronic equipment and readable storage medium, so as to improve the above problems.
第一方面,本申请提供了一种语音控制方法,所述方法包括:获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息;第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容;将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息;执行所述目标描述信息对应控制操作。In a first aspect, the present application provides a voice control method, the method comprising: acquiring a first pinyin content and acquiring a plurality of second pinyin content, the first pinyin content being the pinyin content corresponding to the acquired voice control instruction , the plurality of second pinyin contents include pinyin contents of descriptive information to be selected, and the descriptive information is information used to describe corresponding operations; when the second pinyin contents fail to match with the first pinyin contents, the third Pinyin content, the third pinyin content is pinyin content similar to the first pinyin content; the third pinyin content is matched with the plurality of second pinyin content, and the corresponding second pinyin content is matched with The description information successfully matched by the third pinyin content is used as the target description information; and the control operation corresponding to the target description information is executed.
第二方面,本申请提供了一种语音控制装置,所述装置包括:第一拼音内容以及第二拼音内容获取单元,用于获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息;第三拼音内容获取单元,用于第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容;拼音内容匹配单元,用于将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息;控制操作执行单元,用于执行所述目标描述信息对应控制操作。In a second aspect, the present application provides a voice control device, which includes: a first pinyin content and a second pinyin content acquiring unit, configured to acquire the first pinyin content and a plurality of second pinyin content, the first The first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation; the third pinyin content The obtaining unit is used to obtain the third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content; the pinyin content matching unit, For matching the third pinyin content with the plurality of second pinyin content, and using the description information successfully matching the corresponding second pinyin content with the third pinyin content as the target description information; the control operation execution unit , for executing a control operation corresponding to the target description information.
第三方面,本申请提供了一种电子设备,包括一个或多个处理器以及存储器;一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行上述的方法。In a third aspect, the present application provides an electronic device, including one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, The one or more programs are configured to perform the methods described above.
第四方面,本申请提供的一种计算机可读存储介质,所述计算机可读存储介质中存储有程序代码,其中,在所述程序代码运行时执行上述的方法。In a fourth aspect, the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, wherein the above method is executed when the program code is running.
第五方面,本申请提供了本申请提供了一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现上述方法的步骤。In a fifth aspect, the present application provides a computer program product, including a computer program/instruction, which implements the steps of the above method when the computer program/instruction is executed by a processor.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1示出了本申请实施例提出的一种语音控制方法的一种应用场景的示意图;FIG. 1 shows a schematic diagram of an application scenario of a voice control method proposed in an embodiment of the present application;
图2示出了本申请实施例提出的另一种语音控制方法的一种应用场景的示意图;FIG. 2 shows a schematic diagram of an application scenario of another voice control method proposed in the embodiment of the present application;
图3示出了本申请实施例提出的一种语音控制方法的流程图;FIG. 3 shows a flow chart of a voice control method proposed in an embodiment of the present application;
图4示出了本申请图3中S120的一种实施例方式的流程图;FIG. 4 shows a flowchart of an embodiment of S120 in FIG. 3 of the present application;
图5示出了本申请另一实施例提出的一种语音控制方法的流程图;FIG. 5 shows a flow chart of a voice control method proposed in another embodiment of the present application;
图6示出了本申请图5中S230的一种实施例方式的流程图;FIG. 6 shows a flowchart of an embodiment of S230 in FIG. 5 of the present application;
图7示出了本申请提出的一种获取每个音素对对应的第一替换拼音内容的示意图;Fig. 7 shows a schematic diagram of obtaining the first alternative pinyin content corresponding to each phoneme pair proposed by the present application;
图8示出了本申请提出的一种获取每个指定音素对应的第二替换拼音内容的示意图;Fig. 8 shows a schematic diagram of obtaining the second alternative pinyin content corresponding to each specified phoneme proposed by the present application;
图9示出了本申请再一实施例提出的一种语音控制方法的流程图;FIG. 9 shows a flow chart of a voice control method proposed in another embodiment of the present application;
图10示出了本申请图9中S340的一种实施例方式的流程图;FIG. 10 shows a flowchart of an embodiment of S340 in FIG. 9 of the present application;
图11示出了本申请图9中S350的一种实施例方式的流程图;FIG. 11 shows a flowchart of an embodiment of S350 in FIG. 9 of the present application;
图12示出了本申请提出的一种语音控制方法实现流程的示意图;FIG. 12 shows a schematic diagram of the implementation process of a voice control method proposed in this application;
图13示出了本申请实施例提出的一种语音控制装置的结构框图;FIG. 13 shows a structural block diagram of a voice control device proposed in the embodiment of the present application;
图14示出了本申请提出的一种电子设备的结构框图;Fig. 14 shows a structural block diagram of an electronic device proposed by the present application;
图15是本申请实施例的用于保存或者携带实现根据本申请实施例的语音控制方法的程序代码的存储单元。Fig. 15 is a storage unit for storing or carrying program codes for realizing the voice control method according to the embodiment of the present application according to the embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.
结合人工智能技术以及虚拟个人助理(语音助手),可以使得电子设备通过听觉模态接收用户发出的语音控制指令,再将用户语音控制指令通过自动语音识别(Automatic Speech Recognition,ASR)技术转换为文本信息,再进行后续理解和映射,进而实现对电子设备进行语音控制。Combining artificial intelligence technology and virtual personal assistant (voice assistant), electronic devices can receive voice control commands issued by users through auditory mode, and then convert user voice control commands into text through Automatic Speech Recognition (ASR) technology Information, and then follow-up understanding and mapping, and then realize voice control of electronic devices.
但是,发明人发现,在相关语音控制过程中,由于不同地域的用户口音、语言习惯、下达指令过程中的噪声干扰等各种因素的复杂影响,还存在准确执行语音控制的概率待提升的问题。例如,将用户语音控制指令识别为相似音字符串,如:“上划”识别为“韶华”、“上划”识别为“笑话”等。However, the inventors found that in the related voice control process, due to the complex influence of various factors such as user accents in different regions, language habits, and noise interference in the process of giving instructions, there is still a problem that the probability of accurately executing voice control needs to be improved. . For example, the user's voice control command is recognized as a string of similar sounds, such as: "swipe up" is recognized as "shaohua", "swipe up" is recognized as "joke" and so on.
因此,发明人提出了本申请中的一种语音控制方法、装置、电子设备以及计算机程序产品,该方法在获取语音控制指令对应的拼音内容作为第一拼音内容以及获取待选的描述信息的拼音内容作为多个第二拼音内容后,若确定没有第二拼音内容与所述第一拼音内容成功匹配,再获取与所述第一拼音内容相似的拼音内容作为第三拼音内容,然后将第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息,执行所述目标描述信息对应控制操作。Therefore, the inventor proposes a voice control method, device, electronic equipment, and computer program product in the present application. The method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and acquires the pinyin content of the descriptive information to be selected. After the content is used as a plurality of second pinyin content, if it is determined that there is no second pinyin content successfully matched with the first pinyin content, then obtain the pinyin content similar to the first pinyin content as the third pinyin content, and then use the third pinyin content The pinyin content is matched with the multiple second pinyin content, and the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as the target description information, and the control operation corresponding to the target description information is executed.
从而通过上述方式使得在获取得到由语音控制指令直接转换而来的音频内容后,在直接转换而来的音频内容与待选的描述信息的拼音内容无法成功匹配的情况下,可以再基于直接转换来的语音内容获取对应的相似的拼音内容与待选的描述信息的拼音内容进行匹配,从而使得提升了用户触发的语音控制指令成功匹配到描述信息的概率,进而有利于提升准确执行语音控制的概率。Therefore, through the above method, after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion. The similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.
下面先对本申请实施例所涉及的应用场景进行介绍。The application scenarios involved in the embodiments of the present application are firstly introduced below.
在本申请实施例中,所提供的语音控制方法可以由电子设备执行。在由电子设备执行的这种方式中,本申请实施例提供的语音控制方法中所有步骤可以均由电子设备执行。例如,如图1所示,通过电子设备100的语音采集装置可以采集语音控制指令,并将采集到的语音采集指令以及待选的描述信息均传输给处理器,使得处理器可以获取到第一拼音内容以及获取多个第二拼音内容,进而处理器再利用获取到的第一拼音内容、获取多个第二拼音内容以及与第一拼音内容相似的拼音内容(第三拼音内容)执行本申请提供的语音控制方法涉及的步骤。In the embodiment of the present application, the provided voice control method may be executed by an electronic device. In this mode of being executed by the electronic device, all the steps in the voice control method provided in the embodiment of the present application may be executed by the electronic device. For example, as shown in FIG. 1, the voice collection device of the electronic device 100 can collect voice control instructions, and transmit the collected voice collection instructions and descriptive information to be selected to the processor, so that the processor can obtain the first Pinyin content and obtaining a plurality of second pinyin contents, and then the processor reuses the obtained first pinyin content, obtains a plurality of second pinyin contents and pinyin content (third pinyin content) similar to the first pinyin content to execute the application The steps involved in the provided voice control method.
再者,本申请实施例提供的语音控制方法也可以由服务器进行执行。对应的,在由服务器执行的这种方式中,可以由电子设备采集语音控制指令,并将采集的语音控制指令同步发送给服务器,然后由服务器来执行本申请实施例提供的语音控制方法中以确定目标描述信息,然后由服务器根据目标描述信息生成操作指令。另外,还可以由电子设备和服务器协同执行。在由电子设备和服务器协同执行的这种方式中,本申请实施例提供的语音控制方法中的部分步骤由电子设备执行,而另外部分的步骤则由服务器来执行。Furthermore, the voice control method provided in the embodiment of the present application may also be executed by a server. Correspondingly, in the method executed by the server, the electronic device can collect voice control instructions, and send the collected voice control instructions to the server synchronously, and then the server will execute the voice control method provided by the embodiment of the application. The target description information is determined, and then the server generates an operation instruction according to the target description information. In addition, it can also be executed cooperatively by the electronic device and the server. In the way that the electronic device and the server cooperate to execute, some steps in the voice control method provided by the embodiment of the present application are executed by the electronic device, while other parts of the steps are executed by the server.
示例性的,如图2所示,电子设备100可以执行语音控制方法包括的:获取第一拼音内容以及获取多个第二拼音内容,然后由服务器200来执行后续的步骤。需要说明的是,在由电子设备和服务器协同执行的这种方式中,电子设备和服务器分别执行的步骤不限于上述示例中所介绍的方式,在实际应用中,可以根据实际情况动态的调整电子设备和服务器分别执行的步骤。Exemplarily, as shown in FIG. 2 , the electronic device 100 may execute the voice control method including: acquiring first pinyin content and acquiring a plurality of second pinyin content, and then the server 200 performs subsequent steps. It should be noted that, in this method of cooperative execution by the electronic device and the server, the steps performed by the electronic device and the server respectively are not limited to the method described in the above examples. In practical applications, the electronic device can be dynamically adjusted according to the actual situation Steps performed by the device and the server respectively.
下面将结合附图来介绍本申请所涉及的实施例。The embodiments involved in this application will be described below with reference to the accompanying drawings.
请参阅图3,本申请提供的一种语音控制方法,所述方法包括:Please refer to Figure 3, a voice control method provided by the present application, the method includes:
S110:获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息。S110: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
在本申请实施例中,用户可以通过语音来表达自己的控制意图。对应的,电子设备可以将用户所发出的语音作为语音控制指令。可选的,此处指令是指用户对所见交互界面或交互界面上的元素进行操控的指令,语音控制指令可以包括:左划、打开今日头条、哔哩哔哩、播放XXX、第二排第三个、返回、上划、安装抖音、下一首、桌面等等。In the embodiment of the present application, the user can express his control intention through voice. Correspondingly, the electronic device may use the voice uttered by the user as a voice control instruction. Optionally, the command here refers to the command for the user to manipulate the interactive interface or the elements on the interactive interface. The voice control command may include: swipe left, open today’s headlines, bilibili, play XXX, second row The third one, return, swipe up, install Douyin, next song, desktop, etc.
作为一种方式,可以通过自动语音识别(Automatic Speech Recognition,ASR)技术和自然语言理解(Natural Language Processing,NLP)技术获取第一拼音内容。As a way, the first pinyin content can be obtained through Automatic Speech Recognition (ASR) technology and Natural Language Processing (NLP) technology.
可选的,电子设备在获取到语音控制指令后,可以将用户的语音控制指令传送给ASR模块以得到语音控制指令对应的指令文本,再将指令文本所对应的拼音内容作为第一拼音内容。可选的,在得到指令文本后,还可以通过NLP模块抽取指令文本中的用户意图、控制对象和对象附属信息,整合为形式为{action,object,information}的三元组,其中,action表征用户意图,object表征控制对象,information表征对象附属信息。Optionally, after the electronic device acquires the voice control command, it can transmit the user's voice control command to the ASR module to obtain the command text corresponding to the voice control command, and then use the pinyin content corresponding to the command text as the first pinyin content. Optionally, after obtaining the instruction text, the NLP module can also be used to extract the user intent, control object, and object attachment information in the instruction text, and integrate them into a triple in the form of {action, object, information}, where action represents User intent, object represents the control object, and information represents the object's auxiliary information.
在三元组中,用户意图是指用户希望进行的操作,如:点击、滑动、长按等。附属信息是指可能伴随控制对象的信息,如:进行输入时,文本框为控制对象,待填入的文字为附属信息。需要说明的是,控制对象和附属信息不一定是必须的。在将指令文本转换为三元组的这种方式中,在获取到语音控制指令所对应的三元组后,可以将三元组中的控制对象所对应的拼音作为第一拼音内容,若三元组的控制对象为空,则可以将用户意图所对应的内容作为第一拼音内容。示例性的,用户的语音控制指令可以为“打开今日头条”,通过ASR模块和NLP模块可以得到的三元组为:{点击,今日头条,Φ},其中,用户意图为:“点击”,控制对象为“今日头条”,对象附属信息为空,则第一拼音内容为“jin ri tou tiao”。再例如:用户指令可以为“上划”,通过ASR模块和NLP模块可以得到的三元组为:{上划,Φ,Φ},其中,用户意图为“上划”,控制对象为空,对象附属信息也为空,则第一拼音内容为“shang hua”。In the triplet, user intent refers to the action the user wants to perform, such as: click, swipe, long press, etc. Auxiliary information refers to the information that may accompany the control object. For example, when inputting, the text box is the control object, and the text to be filled is the auxiliary information. It should be noted that the control object and auxiliary information are not necessarily mandatory. In the method of converting the instruction text into triplets, after obtaining the triplet corresponding to the voice control instruction, the pinyin corresponding to the control object in the triplet can be used as the first pinyin content, if three If the control object of the tuple is empty, the content corresponding to the user's intention can be used as the first pinyin content. Exemplarily, the user's voice control instruction can be "Open Toutiao", and the triplet that can be obtained through the ASR module and the NLP module is: {click, Toutiao, Φ}, where the user's intention is: "click", The control object is "Today's Headlines", and the object's auxiliary information is empty, then the first pinyin content is "jin ri tou tiao". Another example: the user instruction can be "swipe up", and the triplet that can be obtained through the ASR module and the NLP module is: {swipe up, Φ, Φ}, where the user's intention is "swipe up", and the control object is empty. If the auxiliary information of the object is also empty, the first pinyin content is "shang hua".
再者,在本申请实施例中,待选描述信息可以为在获取到语音控制指令时,电子设备可以进行的操作的描述信息的集合。其中,电子设备可以进行的操作可以为对电子设备整体所进行的操作,例如,关机、切换运行模式或者拍照等。再者,电子设备可以进行的操作可以包括针对目标界面所进行的操作。该目标界面可以为电子设备当前所显示的界面。再者,针对目标界面所进行的操作的这种方式中,待选描述信息可以包括的目标界面中的多个控件各自的描述信息,例如:“烽火抗大”、“奥运集锦”、“孤独的美食家第八季”等。待选描述信息还可以有所有界面整体操作指令对应的描述信息,例如:左划、右划、上划、下划、返回、桌面、双击、长按等。Furthermore, in the embodiment of the present application, the descriptive information to be selected may be a collection of descriptive information of operations that the electronic device can perform when the voice control instruction is acquired. The operations that can be performed by the electronic device may be operations performed on the entire electronic device, for example, shutting down, switching operation modes, or taking pictures. Furthermore, the operations that can be performed by the electronic device may include operations performed on the target interface. The target interface may be the interface currently displayed by the electronic device. Furthermore, in this way of operating on the target interface, the descriptive information to be selected may include the respective descriptive information of multiple controls in the target interface, for example: "Fenghuo Kangda", "Olympic Highlights", "Lonely The Eighth Season of Gourmet", etc. The descriptive information to be selected can also include descriptive information corresponding to all the overall interface operation commands, such as: swipe left, swipe right, swipe up, swipe down, return, desktop, double-click, long press, etc.
作为一种方式,可以通过获取所有待选描述信息对应的拼音内容来作为第二拼音内容的获取。可选的,可以获取目标界面所包括的多个控件各自的描述信息作为待选描述信息,再将所述待选描述信息转换为对应的拼音内容,以得到多个第二拼音内容。可选的,还可以获取所有界面整体操作指令对应的描述信息作为待选描述信息,再将该待选描述信息转换为对应的拼音内容,以得到多个第二拼音内容。再者,多个第二拼音内容也可以包括界面整体操作指令对应的描述信息对应的拼音内容以及目标界面所包括的多个控件各自的描述信息对应的拼音内容。As a manner, the second pinyin content may be acquired by acquiring the pinyin content corresponding to all the description information to be selected. Optionally, description information of multiple controls included in the target interface may be acquired as candidate description information, and then the candidate description information may be converted into corresponding pinyin content to obtain multiple second pinyin content. Optionally, description information corresponding to all interface overall operation instructions may also be obtained as candidate description information, and then the candidate description information is converted into corresponding pinyin content to obtain multiple second pinyin content. Furthermore, the plurality of second pinyin contents may also include pinyin contents corresponding to the description information corresponding to the overall operation instruction of the interface, and pinyin contents corresponding to the respective description information of the multiple controls included in the target interface.
其中,在本申请实施例中,可以通过系统程序获取目标界面所包括的多个控件各自的描述信息,在这种方式中,电子设备可以利用系统程序解析目标界面所对应的代码,可以获得每一个控件的类型、位置、大小等信息作为控件的描述信息。Among them, in the embodiment of the present application, the description information of multiple controls included in the target interface can be obtained through the system program. In this way, the electronic device can use the system program to analyze the code corresponding to the target interface, and can obtain each Information such as the type, position, and size of a control is used as the description information of the control.
需要说明的是,可以有多种方式得到文本所对应的拼音,例如:Python库中的pypinyin、xpinyin,Java库中的pinyin4J等,可以根据实际开发环境选择使用何种方式实现文本转拼音的操作。It should be noted that there are many ways to obtain the pinyin corresponding to the text, for example: pypinyin, xpinyin in the Python library, pinyin4J in the Java library, etc. You can choose which method to use to implement the operation of converting text to pinyin according to the actual development environment .
S120:第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容。S120: Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.
在获取得到第一拼音内容和第二拼音内容后,可以检测多个第二拼音内容中是否有与第一拼音内容成功匹配的第二拼音内容。可选的,若有第二拼音内容与第一拼音内容完全相同则确定该第二拼音内容与该第一拼音内容成功匹配。示例性的,若第一拼音内容为“shao hua”,当前与“shao hua”进行匹配的第二拼音内容为“shang hua”,那么因为第一拼音内容中的“ao”和第二拼音内容中的“ang”不同,则确定第一拼音内容“shao hua”和第二拼音内容“shang hua”不匹配。After the first pinyin content and the second pinyin content are obtained, it may be detected whether there is a second pinyin content that successfully matches the first pinyin content among the plurality of second pinyin content. Optionally, if the content of the second pinyin is completely the same as the content of the first pinyin, it is determined that the content of the second pinyin matches the content of the first pinyin successfully. For example, if the first pinyin content is "shao hua", and the second pinyin content currently matching "shao hua" is "shang hua", then because "ao" in the first pinyin content and the second pinyin content If the "ang" in is different, it is determined that the first pinyin content "shao hua" does not match the second pinyin content "shang hua".
在本申请实施例中,如图4所示,作为一种方式,获取第三拼音内容,包括:In the embodiment of the present application, as shown in Figure 4, as a way to obtain the third pinyin content, including:
S121:获取所述第一拼音内容中的指定音素对应的相似音素。S121: Obtain a similar phoneme corresponding to a specified phoneme in the first pinyin content.
其中,作为一种方式,可以通过查询第一拼音内容所包括的音素在音素扩展表中是否有对应的音素对应关系,每一个所述音素对应关系表征一对相似的音素;将确定有所述音素对应关系的音素作为指定音素,并基于所述音素对应关系确定指定音素对应的相似音素。Wherein, as a method, whether there is a corresponding phoneme correspondence in the phoneme expansion table by querying the phonemes included in the first pinyin content, each of the phoneme correspondences represents a pair of similar phonemes; The phoneme corresponding to the phoneme is used as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
其中,音素(phone)是根据语音的自然属性划分出来的最小语音单位,一个发音动作形成一个音素。在汉语中,可以将音素分为声母和韵母。在汉语拼音记法规则中,在韵母i和以i开头的复韵母(如:i、ia、ie、iao、iou、ian、in、iang、iong等)的前面加y时,可以记为yi、ya、ye、yao、you、yan、yin、yang、yong等;在韵母ü所对应的声母为j、q、x或没有声母时,可以 省略ü上面两点记为u,如:yu、yue、yuan、yun、ju、qu、xu;而当韵母ü所对应的声母为n、l时,则可以写成nü、lü,因此,在一些情况下,可用u替代ü。Among them, a phoneme (phone) is the smallest unit of speech divided according to the natural properties of speech, and a pronunciation action forms a phoneme. In Chinese, phonemes can be divided into initials and finals. In the Chinese pinyin notation rules, when y is added in front of the final i and the compound finals beginning with i (such as: i, ia, ie, iao, iou, ian, in, iang, iong, etc.), it can be recorded as yi . yue, yuan, yun, ju, qu, xu; and when the initial consonant corresponding to the final ü is n, l, it can be written as nü, lü. Therefore, in some cases, u can be used instead of ü.
并且,由于不同地域的用户口音、语言习惯等的影响,用户可能会将一些相似音素混淆,从而导致用户语音控制指令识别不准确的情况。因此,可以结合汉语拼音记法规则和汉语发音常见错误,形成如表1所示的一种音素扩展表。Moreover, due to the influence of user accents and language habits in different regions, the user may confuse some similar phonemes, resulting in inaccurate recognition of the user's voice control commands. Therefore, a phoneme expansion table as shown in Table 1 can be formed by combining the notation rules of Chinese Pinyin and common mistakes in Chinese pronunciation.
表1Table 1
Figure PCTCN2022107788-appb-000001
Figure PCTCN2022107788-appb-000001
示例性的,当用户语音控制指令“上划”被识别为谐音词“韶华”并将“shao hua”作为第一拼音内容时,因为该第一拼音内容所包括的音素为:sh、ao、h、ua,所以根据音素扩展表可以得到以下音素对应关系:[sh,s]、[sh,c]、[sh,xi]、[sh,zh]、[ao,ou]、[ao,iao]、[ao,ang]、[h,f],则可以将sh、ao、h作为指定音素,并基于上述音素对应关系确定指定音素对应的相似音素为:s、c、xi、zh、ou、iao、ang、f。Exemplarily, when the user's voice control instruction "swipe up" is recognized as the homophonic word "邵华" and "shao hua" is used as the first pinyin content, because the phonemes included in the first pinyin content are: sh, ao, h, ua, so the following phoneme correspondences can be obtained according to the phoneme expansion table: [sh, s], [sh, c], [sh, xi], [sh, zh], [ao, ou], [ao, iao ], [ao, ang], [h, f], then sh, ao, and h can be used as the specified phoneme, and based on the above-mentioned phoneme correspondence, the similar phonemes corresponding to the specified phoneme can be determined as: s, c, xi, zh, ou , iao, ang, f.
S122:用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容。S122: Replace the specified phoneme in the first pinyin content with the similar phoneme to obtain a third pinyin content.
作为另外一种方式,还可以从整体上获取与第一拼音内容相似的拼音内容作为第三拼音内容。在这种方式中,可以直接预先获取多个词语所对应拼音内容的特征作为参考特征,在得到第一拼音内容后,可以通过同样的方式获取得到第一拼音内容的特征,然后将第一拼音内容的特征与预先获取的参考特征分别进行比对,将对比成功的参考特征所对应的拼音内容作为第三拼音内容。其中,比对成功的参考特征与第一拼音内容的特征相同。在这种方式中,相关的获取数据特征的方式均可以适用于获取拼音内容的特征,对于具体获取拼音内容的特征的方式在本申请实施例中不做具体限定。例如,可以通过文本向量的方式来获取拼音内容的特征。As another way, the pinyin content similar to the first pinyin content can also be obtained as the third pinyin content as a whole. In this way, the features of the pinyin content corresponding to multiple words can be directly obtained in advance as reference features. After obtaining the first pinyin content, the features of the first pinyin content can be obtained in the same way, and then the first pinyin The features of the content are compared with the pre-acquired reference features, and the pinyin content corresponding to the successfully compared reference features is used as the third pinyin content. Among them, the reference feature of the successful comparison is the same as the feature of the first pinyin content. In this way, the related methods of acquiring data features can be applied to acquire features of Pinyin content, and the specific way of acquiring features of Pinyin content is not specifically limited in this embodiment of the present application. For example, the features of pinyin content can be obtained by means of text vectors.
S130:将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息。S130: Match the third pinyin content with the plurality of second pinyin content, and use the description information that the corresponding second pinyin content successfully matches with the third pinyin content as target description information.
示例性的,第三拼音内容可以为:{“sao hua”,“cao hua”,“xiao hua”,“zhao hua”,“shou hua”,“shiao hua”,“shang hua”,“shao fua”},第二拼音内容可以为:{“feng huo kang da(烽火抗大)”,“ao yun ji jin(奥运集锦)”,“gu du de mei shi jia di ba ji(孤独的美食家第八季)”,...,“zuo hua”,“you hua”,“shang hua”,“xia hua”,“fan hui”,“zhuo mian”,“shuang ji”,“chang an”},则将上述第三拼音内容与第二拼音内容进行匹配,可以得到目标描述信息“shang hua”。Exemplarily, the content of the third pinyin can be: {"sao hua", "cao hua", "xiao hua", "zhao hua", "shou hua", "shiao hua", "shang hua", "shao fua "}, the content of the second pinyin can be: {"feng huo kang da (Fenghuo Kangda)", "ao yun ji jin (Olympic collection)", "gu du de mei shi jia di ba ji (the eighth Season)",..., "zuo hua", "you hua", "shang hua", "xia hua", "fan hui", "zhuo mian", "shuang ji", "chang an"}, then Match the content of the above third pinyin with the content of the second pinyin to obtain the target description information "shang hua".
S140:执行所述目标描述信息对应控制操作。S140: Execute a control operation corresponding to the target description information.
其中,作为一种方式,目标描述信息可以为目标界面中控件所对应描述信息,可以结合目标描述信息对应控件所属的三元组中的用户意图和对象附属信息,以事件注入或模拟点击的方式在电子设备执行与目标描述信息对应控制操作。例如:若目标描述信息为“sou suo kuang”,则可以结合三元组{查找,搜索框,快乐的大脚}中的用户意图和对象附属信息,通过注入事件:在搜索框中输入快乐的大脚,可以在电子设备执行与目标描述信息“sou suo kuang”对应控制操作。再例如:若目标描述信息为“ao yun ji jin”,则可以结合三元组{点击,奥运集锦,Φ}中的用户意图,通过点击奥运集锦控件的方式在电子设备执行与目标描述信息“ao yun ji jin”对应控制操作。Among them, as a method, the target description information can be the description information corresponding to the control in the target interface, and can be combined with the user intent and object attachment information in the triple group corresponding to the control to which the target description information belongs, in the way of event injection or simulated click Execute the control operation corresponding to the target description information on the electronic device. For example: if the target description information is "sou suo kuang", you can combine the user intent and object attachment information in the triplet {search, search box, happy feet} to inject an event: enter happy in the search box Bigfoot can perform control operations corresponding to the target description information "sou suo kuang" on electronic devices. Another example: if the target description information is "ao yun ji jin", you can combine the user intention in the triplet {click, Olympic highlights, Φ}, and execute it on the electronic device by clicking the Olympic highlights control with the target description information " ao yun ji jin” corresponds to the control operation.
作为另一种方式,目标描述信息可以为界面整体操作指令对应的描述信息。例如:若目标描述 信息为“shang hua”,则可以直接在电子设备执行上划的操作。As another manner, the target description information may be description information corresponding to an overall interface operation instruction. For example: if the target description information is "shang hua", the operation of swiping up can be directly performed on the electronic device.
本实施例提供的一种语音控制方法,该方法在获取语音控制指令对应的拼音内容作为第一拼音内容以及获取待选的描述信息的拼音内容作为多个第二拼音内容后,若确定没有第二拼音内容与所述第一拼音内容成功匹配,再获取与所述第一拼音内容相似的拼音内容作为第三拼音内容,然后将第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息,执行所述目标描述信息对应控制操作。In the voice control method provided by this embodiment, after the method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and the pinyin content of the descriptive information to be selected as multiple second pinyin content, if it is determined that there is no second pinyin content The second pinyin content is successfully matched with the first pinyin content, and then the pinyin content similar to the first pinyin content is obtained as the third pinyin content, and then the third pinyin content is matched with the plurality of second pinyin content, Using the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information, the control operation corresponding to the target description information is executed.
从而通过上述方式使得在获取得到由语音控制指令直接转换而来的音频内容后,在直接转换而来的音频内容与待选的描述信息的拼音内容无法成功匹配的情况下,可以再基于直接转换来的语音内容获取对应的相似的拼音内容与待选的描述信息的拼音内容进行匹配,从而使得提升了用户触发的语音控制指令成功匹配到描述信息的概率,进而有利于提升准确执行语音控制的概率。并且,在本实施例中,结合语言学、声学中音素的概念,根据汉语普通话中常见错误,建立声母、韵母混淆扩展表,将无法精确匹配的拼音进行模糊扩展,再进行匹配,从而还解决语音识别过程中出现谐音字错误的问题,还可以有效解决用户发音不标准导致的语音识别错误。Therefore, through the above method, after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion. The similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability. And, in this embodiment, in combination with the concepts of phonemes in linguistics and acoustics, according to the common mistakes in Mandarin Chinese, the initial consonant and final consonant confusion expansion table is established, and the pinyin that cannot be accurately matched is fuzzy expanded, and then matched, thereby also solving the problem. The problem of homonym errors in the speech recognition process can also effectively solve the speech recognition errors caused by the user's non-standard pronunciation.
请参阅图5,本申请提供的一种语音控制方法,所述方法包括:Please refer to Figure 5, a voice control method provided by the present application, the method includes:
S210:获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息。S210: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
S220:获取所述第一拼音内容中的指定音素对应的相似音素。S220: Obtain a similar phoneme corresponding to a specified phoneme in the first pinyin content.
S230:第二拼音内容与所述第一拼音内容未成功匹配时,用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容。S230: When the second pinyin content fails to match the first pinyin content, replace the specified phoneme in the first pinyin content with the similar phoneme to obtain a third pinyin content.
其中,在本申请实施例中,所述指定音素可以有多个。作为一种方式,可以分别用多个所述相似音素对所述第一拼音内容中的指定音素进行替换,得到多个所述相似音素各自对应的进行音素替换后的第一拼音内容,以作为第三拼音内容。Wherein, in the embodiment of the present application, there may be multiple specified phonemes. As a method, the specified phonemes in the first pinyin content can be replaced with multiple similar phonemes, respectively, to obtain the first pinyin content after phoneme replacement corresponding to the multiple similar phonemes, as The content of the third pinyin.
示例性的,第一拼音内容可以为“shao hua”,则由表1可知,第一拼音内容“shao hua”的指定音素可以为sh、ao、h,其中,sh所对应的相似音素为s、c、xi、zh,ao所对应的相似音素为ou、iao、ang,h所对应的相似音素为f。分别用多个所述相似音素对所述第一拼音内容中的指定音素进行替换,可以得到的第三拼音内容为{“sao hua”,“cao hua”,“xiao hua”,“zhao hua”,“shou hua”,“shiao hua”,“shang hua”,“shao fua”}。Exemplarily, the first pinyin content can be "shao hua", then it can be seen from Table 1 that the specified phonemes of the first pinyin content "shao hua" can be sh, ao, h, wherein, the similar phoneme corresponding to sh is s , c, xi, zh, the similar phonemes corresponding to ao are ou, iao, ang, and the similar phonemes corresponding to h are f. Respectively replace the specified phonemes in the first pinyin content with multiple similar phonemes, and the third pinyin content that can be obtained is {"sao hua", "cao hua", "xiao hua", "zhao hua" , "shou hua", "shiao hua", "shang hua", "shao hua"}.
作为另一种方式,如图6所示,用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容,包括:As another way, as shown in Figure 6, replace the specified phoneme in the first pinyin content with the similar phoneme to obtain the third pinyin content, including:
S231:将至少两个指定音素各自对应的相似音素相互进行组合,得到多个音素对,其中,每个音素对包括有所述至少两个指定音素各自对应的一个相似音素。S231: Combine similar phonemes corresponding to at least two specified phonemes with each other to obtain multiple phoneme pairs, where each phoneme pair includes a similar phoneme corresponding to each of the at least two specified phonemes.
其中,在本申请实施例中,可以按照图7所示的组合方式对至少两个指定音素各自对应的相似音素相互进行组合。请参阅图5,指定音素A对应有相似音素O、P、Q,指定音素B对应有相似音素R、S、T,第一拼音内容为ABC,则可以将指定音素A的每一个相似音素与指定音素B的所有相似音素进行逐一组合,得到以下音素对:OR、OS、OT、PR、PS、PT、QR、QS、QT。示例性的,第一拼音内容可以为“shao hua”,可以选择将第一拼音内容“shao hua”对应指定音素中的sh、ao各自对应的相似音素以图5中的组合方式相互进行组合,得到以下音素对:sou、siao、sang、cou、ciao、cang、xiou、...、zhang。Wherein, in the embodiment of the present application, similar phonemes corresponding to at least two specified phonemes may be combined with each other according to the combination manner shown in FIG. 7 . Referring to Fig. 5, designated phoneme A corresponds to similar phonemes O, P, Q, and designated phoneme B corresponds to similar phonemes R, S, T, and the first pinyin content is ABC, then each similar phoneme of designated phoneme A can be combined with All similar phonemes of the specified phoneme B are combined one by one to obtain the following phoneme pairs: OR, OS, OT, PR, PS, PT, QR, QS, QT. Exemplarily, the first pinyin content can be "shao hua", and the first pinyin content "shao hua" can be selected to be combined with similar phonemes corresponding to sh and ao in the specified phonemes in the combination manner shown in Figure 5, The following phoneme pairs are obtained: sou, siao, sang, cou, ciao, cang, xiou, ..., zhang.
S232:分别基于所述多个音素对第一拼音内容中所对应的指定音素进行替换,得到每个音素对对应的第一替换拼音内容。S232: Respectively replace the specified phonemes corresponding to the first pinyin content based on the plurality of phonemes to obtain the first replaced pinyin content corresponding to each phoneme pair.
其中,在本申请实施例中,如图7所示,在得到多个音素对(OR、OS、OT、PR、PS、PT、QR、QS、QT)后,通过分别基于上述多个音素对第一拼音内容ABC中所对应的指定音素进行替换,可以得到的第一替换拼音内容为:ORC、OSC、OTC、PRC、PSC、PTC、QRC、QSC、QTC。示例性的,若音素对为sou,则对应的第一替换拼音内容为“sou hua”,若音素对为cang,则对应的第一替换拼音内容为“cang hua”。Wherein, in the embodiment of the present application, as shown in FIG. 7, after obtaining a plurality of phoneme pairs (OR, OS, OT, PR, PS, PT, QR, QS, QT), by respectively based on the above-mentioned plurality of phoneme pairs The corresponding specified phoneme in the first pinyin content ABC is replaced, and the first replacement pinyin content that can be obtained is: ORC, OSC, OTC, PRC, PSC, PTC, QRC, QSC, QTC. Exemplarily, if the phoneme pair is sou, the corresponding first replacement pinyin content is "sou hua", and if the phoneme pair is cang, the corresponding first replacement pinyin content is "cang hua".
S233:用多个指定音素各自对应的相似音素对所述第一拼音内容中所对应的指定音素进行替换,得到每个指定音素对应的第二替换拼音内容。S233: Replace the corresponding designated phonemes in the first pinyin content with similar phonemes corresponding to the plurality of designated phonemes to obtain a second replaced pinyin content corresponding to each designated phoneme.
其中,在本申请实施例中,可以按照图8所示的方式对第一拼音内容中的指定音素进行替换,得到每个指定音素对应的第二替换拼音内容。请参阅图8,指定音素A对应有相似音素O、P、Q,指定音素B对应有相似音素R、S、T,第一拼音内容为ABC,则可以用相似音素O、P、Q逐一对指定音素A进行替换,得到指定音素A对应的第二替换拼音内容OBC、PBC、QBC,再用相似音素R、S、T对指定音素B进行替换得到指定音素B对应的第二替换拼音内容ARC、ASC、ATC。示例性的,由表1可知,第一拼音内容可以为“shao hua”的指定音素为sh、ao、h,则sh对应的第二替换拼音内容为{“sao hua”,“cao hua”,“xiao hua”,“zhao hua”},ao对应的第二替换拼音内 容为{“shou hua”,“shiao hua”,“shang hua”,h对应的第二替换拼音内容为{“shao fua”}。Wherein, in the embodiment of the present application, the specified phonemes in the first pinyin content may be replaced in the manner shown in FIG. 8 to obtain the second replaced pinyin content corresponding to each specified phoneme. Please refer to Figure 8, the specified phoneme A corresponds to similar phonemes O, P, Q, the specified phoneme B corresponds to similar phonemes R, S, T, and the first pinyin content is ABC, then you can use similar phonemes O, P, Q to pair one by one Replace the specified phoneme A to obtain the second replacement pinyin content OBC, PBC, QBC corresponding to the specified phoneme A, and then replace the specified phoneme B with similar phonemes R, S, T to obtain the second replacement pinyin content ARC corresponding to the specified phoneme B , ASC, ATC. Exemplary, as can be seen from Table 1, the first pinyin content can be the specified phoneme of "shao hua" as sh, ao, h, then the second replacement pinyin content corresponding to sh is {"sao hua", "cao hua", "xiao hua", "zhao hua"}, the content of the second alternate pinyin corresponding to ao is {"shou hua", "shiao hua", "shang hua", the second alternate pinyin content corresponding to h is {"shao fua" }.
S234:将所述第一替换拼音内容和所述第二替换拼音内容作为第三拼音内容。S234: Use the first replaced pinyin content and the second replaced pinyin content as the third pinyin content.
与第一种得到第三拼音内容的方式相比,通过将第一替换拼音内容和第二替换拼音内容都作为第三拼音内容,可以进一步对第一拼音内容进行相似性扩充,使得与第二拼音内容进行匹配的范围进一步扩大,从而提高了匹配成功的概率。Compared with the first way of obtaining the content of the third pinyin, by using the content of the first alternate pinyin and the content of the second alternate pinyin as the content of the third pinyin, the similarity of the first pinyin content can be further expanded, so that the content of the first pinyin can be compared with the second The scope for matching the pinyin content is further expanded, thereby increasing the probability of successful matching.
S240:将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息。S240: Match the third pinyin content with the multiple second pinyin content, and use the description information of the corresponding second pinyin content successfully matched with the third pinyin content as target description information.
S250:执行所述目标描述信息对应控制操作。S250: Execute a control operation corresponding to the target description information.
本实施例提供的一种语音控制方法,通过上述方式使得在获取得到由语音控制指令直接转换而来的音频内容后,在直接转换而来的音频内容与待选的描述信息的拼音内容无法成功匹配的情况下,可以再基于直接转换来的语音内容获取对应的相似的拼音内容与待选的描述信息的拼音内容进行匹配,从而使得提升了用户触发的语音控制指令成功匹配到描述信息的概率,进而有利于提升准确执行语音控制的概率。并且,在本实施例中,可以通过查询音素扩展表得到指定音素的相似音素,并通过多种方式利用相似音素对多个指定音素进行替换得到第三拼音内容,由于第三拼音内容是在第一拼音内容的基础上进行的相似扩充,使得匹配范围增大,提高了匹配成功的概率,进而提高了准确执行语音控制的概率。In the voice control method provided by this embodiment, after the audio content directly converted from the voice control command is acquired, the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method. In the case of matching, the corresponding similar pinyin content can be obtained based on the directly converted voice content and matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control. Moreover, in this embodiment, the similar phonemes of the specified phonemes can be obtained by querying the phoneme expansion table, and the third pinyin content can be obtained by replacing multiple specified phonemes with similar phonemes in various ways, because the third pinyin content is in the first The similar expansion based on the content of a pinyin increases the matching range, improves the probability of successful matching, and further increases the probability of accurately executing voice control.
请参阅图9,本申请提供的一种语音控制方法,应用于电子设备,所述方法包括:Please refer to FIG. 9, a voice control method provided by the present application is applied to electronic equipment, and the method includes:
S310:获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息。S310: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.
S320:第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容。S320: Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.
S330:将所述第三拼音内容与所述多个第二拼音内容进行匹配,第二拼音内容与所述第三拼音内容成功匹配时将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息。S330: Match the third pinyin content with the multiple second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, match the corresponding second pinyin content with the third pinyin content successfully The matching description information is used as the target description information.
S340:第二拼音内容与所述第三拼音内容未成功匹配时,获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度。S340: When the second pinyin content does not match the third pinyin content successfully, obtain the similarities between multiple second pinyin content and the first pinyin content respectively, so as to obtain the similarity corresponding to each second pinyin content.
其中,如图10所示,获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度,可以包括:Wherein, as shown in FIG. 10 , obtaining the similarities between a plurality of second pinyin content and the first pinyin content respectively, so as to obtain the corresponding similarity of each second pinyin content may include:
S341:基于最长公共子序列的方式获取多个第二拼音内容分别与所述第一拼音内容的第一参考相似度,以得到每个第二拼音内容对应的第一参考相似度。S341: Acquire first reference similarities between multiple second pinyin contents and the first pinyin contents based on the longest common subsequence, so as to obtain a first reference similarity corresponding to each second pinyin content.
其中,在本申请实施例中,可以通过最长公共子序列(Longest Common Subsequence,LCS)衡量多个第二拼音内容分别与第一拼音内容的第一参考相似度,LCS的计算公式可以为:Wherein, in the embodiment of the present application, the first reference similarity between a plurality of second pinyin content and the first pinyin content can be measured by the longest common subsequence (Longest Common Subsequence, LCS), and the calculation formula of LCS can be:
Figure PCTCN2022107788-appb-000002
Figure PCTCN2022107788-appb-000002
其中,A i可以表示由字符串A的前i个字符组成的字符串,i的取值范围为0~字符串A最大长度,同理,B j可以表示由字符串B的前j个字符组成的字符串,j的取值范围为0~字符串B最大长度,a i、b j可以分别表示A、B中的第i、j个字符。示例性的,可以用字符串A表示第一拼音内容,字符串B表示一个第二拼音内容,第一拼音内容的长度为10,第二拼音内容的长度为9,则i的取值范围为0~10,j的取值范围为0~9,若a 10=b 9,则LCS(A 10,B 9)=LCS(A 9,B 8)+a 10,否则LCS(A 10,B 9)=max{LCS(A 10,B 8),LCS(A 9,B 9)}。 Among them, A i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A. Similarly, B j can represent the first j characters of string B The value range of j is from 0 to the maximum length of string B, and a i and b j can represent the i-th and j-th characters in A and B, respectively. For example, character string A can be used to represent the first pinyin content, and character string B can represent a second pinyin content, the length of the first pinyin content is 10, and the length of the second pinyin content is 9, then the value range of i is 0~10, the value range of j is 0~9, if a 10 =b 9 , then LCS(A 10 ,B 9 )=LCS(A 9 ,B 8 )+a 10 , otherwise LCS(A 10 ,B 9 )=max{LCS(A 10 , B 8 ), LCS(A 9 , B 9 )}.
LCS相似度可以定义为:LCS similarity can be defined as:
Figure PCTCN2022107788-appb-000003
Figure PCTCN2022107788-appb-000003
其中,|A|、|B|可以分别表示字符串A、B的长度,即A、B中所有字符的个数。示例性的,字符串A可以为”APPLE13”,则|A|=7。Among them, |A| and |B| can represent the lengths of strings A and B respectively, that is, the number of all characters in A and B. Exemplarily, the character string A may be "APPLE13", then |A|=7.
S342:基于编辑距离的方式获取多个第二拼音内容分别与所述第一拼音内容的第二参考相似度,以得到每个第二拼音内容对应的第二参考相似度。S342: Obtain second reference similarities between the plurality of second pinyin contents and the first pinyin contents based on edit distance, so as to obtain a second reference similarity corresponding to each second pinyin content.
其中,在本申请实施例中,可以通过编辑距离(Levenshtein Distance,LEV)衡量多个第二拼音内容分别与第一拼音内容之间的差异程度,由于相似度与差异程度反相关,可以通过下述公式衡量多个第二拼音内容分别与第一拼音内容的第二参考相似度。Wherein, in the embodiment of the present application, the degree of difference between a plurality of second pinyin content and the first pinyin content can be measured by editing distance (Levenshtein Distance, LEV). The above formula measures the second reference similarity between the multiple second pinyin content and the first pinyin content respectively.
Figure PCTCN2022107788-appb-000004
Figure PCTCN2022107788-appb-000004
其中,LEV的计算公式可以为:Among them, the calculation formula of LEV can be:
Figure PCTCN2022107788-appb-000005
Figure PCTCN2022107788-appb-000005
其中,A i可以表示由字符串A的前i个字符组成的字符串,i的取值范围为0~字符串A最大长度,同理,B j可以表示由字符串B的前j个字符组成的字符串,j的取值范围为0~字符串B最大长度。示例性的,可以用字符串A表示第一拼音内容,字符串B表示一个第二拼音内容,第一拼音内容的长度为10,第二拼音内容的长度为9,则i的取值范围为0~10,j的取值范围为0~9,若a 10=b 9,则LEV(A 10,B 9)=min{LEV(A 9,B 10)+1,LEV(A 10,B 9)+1,LEV(A 9,B 8)},否则LEV(A 10,B 9)=min{LEV(A 9,B 10)+1,LEV(A 10,B 9)+1,LEV(A 9,B 8)+1}。 Among them, A i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A. Similarly, B j can represent the first j characters of string B The string formed by j is in the range of 0 to the maximum length of string B. For example, character string A can be used to represent the first pinyin content, and character string B can represent a second pinyin content, the length of the first pinyin content is 10, and the length of the second pinyin content is 9, then the value range of i is 0~10, the value range of j is 0~9, if a 10 =b 9 , then LEV(A 10 ,B 9 )=min{LEV(A 9 ,B 10 )+1, LEV(A 10 ,B 9 )+1, LEV(A 9 ,B 8 )}, otherwise LEV(A 10 ,B 9 )=min{LEV(A 9 ,B 10 )+1,LEV(A 10 ,B 9 )+1,LEV (A 9 ,B 8 )+1}.
S343:将每个第二拼音内容对应的第一参考相似度和第二参考相似度相加,得到每个第二拼音内容对应的相似度。S343: Add the first reference similarity corresponding to each second pinyin content and the second reference similarity to obtain the similarity corresponding to each second pinyin content.
其中,作为一种方式,可以直接将第一参考相似度和第二参考相似度相加得到每个第二拼音内容对应的相似度,其计算公式如下:Wherein, as a method, the similarity corresponding to each second pinyin content can be obtained by directly adding the first reference similarity and the second reference similarity, and the calculation formula is as follows:
S(A,B)=S LCS(A,B)+S LEV(A,B) S(A,B)=S LCS (A,B)+S LEV (A,B)
作为另一种方式,可以分别赋予第一参考相似度和第二参考相似度各自对应的权重,对第一参考相似度和第二参考相似度加权后再相加得到每个第二拼音内容对应的相似度,其计算公式如下:As another way, the weights corresponding to the first reference similarity and the second reference similarity can be assigned respectively, and the weights of the first reference similarity and the second reference similarity can be added to obtain each second pinyin content corresponding to The similarity is calculated by the following formula:
S(A,B)=X×S LCS(A,B)+Y×S LEV(A,B) S(A,B)=X×S LCS (A,B)+Y×S LEV (A,B)
其中,X+Y=1。Wherein, X+Y=1.
S350:将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息。S350: Use the description information corresponding to the second pinyin content with the highest similarity as the target description information.
其中,如图11所示,将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息,包括:Wherein, as shown in FIG. 11 , the descriptive information corresponding to the second pinyin content with the highest similarity is used as the target descriptive information, including:
S351:若对应的相似度最大的第二拼音内容有一个,则将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息。S351: If there is one second pinyin content with the highest similarity, use the description information corresponding to the second pinyin content with the highest similarity as the target description information.
S352:若对应的相似度最大的第二拼音内容有多个,获取所述语音控制指令对应的文本内容的文本向量作为第一文本向量。S352: If there are multiple second pinyin contents with the highest similarity, acquire a text vector of the text content corresponding to the voice control instruction as a first text vector.
其中,在本申请实施例中,用户的语音控制指令中可能会出现一些缩略语或者简称,这可能会导致通过最长公共子序列和编辑距离的方式得到多个最相似的结果。例如:用户语音控制指令为“复联”,第二拼音内容集合中包括{“复仇者联盟4”,“复印几张对联”},“复联”与两个待匹配对象的最长公共子序列都为“复联”,编辑距离都为4,因此计算出的相似度相同,无法确定出唯一结果。再例如:用户语音控制指令为“B站”,而第二拼音内容集合中包括{哔哩哔哩、Q音乐、A云音乐、B音乐},无法得到匹配结果。在这种情况下,可以基于语义相似度来衡量多个第二拼音内容分别与第一拼音内容的相似度,从而得出一个最相似的第二拼音内容。Among them, in the embodiment of the present application, some abbreviations or abbreviations may appear in the user's voice control instructions, which may result in obtaining multiple most similar results by means of the longest common subsequence and edit distance. For example: the user's voice control command is "Fulian", the second pinyin content set includes {"Avengers 4", "Copy a few couplets"}, "Fulian" and the longest common subtitle of the two objects to be matched The sequences are all "multilinks" and the edit distance is 4, so the calculated similarities are the same, and a unique result cannot be determined. Another example: the user's voice control command is "B station", and the second pinyin content set includes {哔哩哔哩, Q Music, A Cloud Music, B Music}, and no matching result can be obtained. In this case, the similarities between the multiple second pinyin contents and the first pinyin contents can be measured based on the semantic similarity, so as to obtain the most similar second pinyin contents.
其中,作为一种方式,可以通过预训练模型BERT获得文本向量。BERT是一个深度神经网络,可以将需处理的文本输入BERT的编码器部分,得到对应的文本向量。Among them, as a way, the text vector can be obtained through the pre-training model BERT. BERT is a deep neural network that can input the text to be processed into the encoder part of BERT to obtain the corresponding text vector.
其中,在本申请实施例中,第一文本向量所对应的文本输入可以为通过ASR模块获取到的语音控制指令对应的文本内容,也可以为通过ASR模块和NLP模块获取到的语音控制指令对应的三元组的文本内容,还可以为第三拼音所对应的文本内容。Wherein, in the embodiment of the present application, the text input corresponding to the first text vector may be the text content corresponding to the voice control instruction obtained through the ASR module, or may be the corresponding text content of the voice control instruction obtained through the ASR module and the NLP module. The text content of the triplet can also be the text content corresponding to the third pinyin.
S353:获取多个相似度最大的第二拼音内容各自对应的描述信息对应的文本向量,以得到多个第二文本向量。S353: Obtain multiple text vectors corresponding to the description information corresponding to the second pinyin content with the highest similarity, so as to obtain multiple second text vectors.
其中,在本申请实施例中,第二文本向量所对应的文本输入可以为通过系统程序获取到的目标界面中多个控件各自的文本描述信息,也可以为界面整体操作指令的文本描述信息,例如:左划、右划、上划、下划、返回、桌面、双击、长按等。Wherein, in the embodiment of the present application, the text input corresponding to the second text vector may be the respective text description information of multiple controls in the target interface obtained through the system program, or may be the text description information of the overall operation instruction of the interface, For example: swipe left, swipe right, swipe up, swipe down, back, desktop, double click, long press, etc.
需要说明的是,文本向量所对应的文本输入可以为汉字字符串,也可以为拼音字符串。It should be noted that the text input corresponding to the text vector can be a Chinese character string or a pinyin string.
再者,需要说明的是,本申请实施例还可以通过Doc2Vec(文档转向量)等工具,或者RoBERTA、UniLM、ELECTRA、XLNet等开源预训练模型来获取文本向量。Furthermore, it should be noted that in the embodiment of the present application, text vectors can also be obtained through tools such as Doc2Vec (document-to-vector), or open-source pre-training models such as RoBERTA, UniLM, ELECTRA, and XLNet.
S354:分别计算得到多个第二文本向量与所述第一文本向量的向量距离。S354: Calculate respectively the vector distances between the multiple second text vectors and the first text vector.
其中,作为一种方式,通过余弦相似度计算每个第二文本向量与第一文本向量的向量距离,其计算公式如下:Wherein, as a method, the vector distance between each second text vector and the first text vector is calculated by cosine similarity, and the calculation formula is as follows:
Figure PCTCN2022107788-appb-000006
Figure PCTCN2022107788-appb-000006
S355:将对应的向量距离最小的一个第二文本向量对应的描述信息作为目标描述信息。S355: Use the description information corresponding to a second text vector with the smallest corresponding vector distance as the target description information.
其中,作为一种方式,在得到多个第二文本向量与第一文本向量的向量距离后,可以对多个向量距离的大小进行排序,将向量距离最小的一个第二文本向量对应的描述信息作为目标描述信息。Among them, as a method, after obtaining the vector distances between multiple second text vectors and the first text vector, the magnitudes of the multiple vector distances can be sorted, and the description information corresponding to the second text vector with the smallest vector distance as the target description information.
需要说明的是,由于文本向量在高维空间中连续分布,出现两个相似度数值上完全相同的文本向量的概率微乎不计,因此,可确定唯一的第二文本向量对应的描述信息作为目标描述信息。It should be noted that since the text vectors are continuously distributed in the high-dimensional space, the probability of two text vectors with the same similarity value is negligible. Therefore, the description information corresponding to the unique second text vector can be determined as the target Description.
通过上述方式使得,当因用户的语音控制指令中存在缩略语或者简称而得不到唯一匹配结果时,可以计算多个第二文本向量与所述第一文本向量的向量距离,得到唯一匹配结果所对应的目标描述信息,以便执行目标描述信息对应控制操作,进一步地提高了语义识别的成功率。Through the above method, when a unique matching result cannot be obtained due to the existence of abbreviations or abbreviations in the user's voice control command, the vector distance between multiple second text vectors and the first text vector can be calculated to obtain a unique matching result The corresponding target description information, so as to execute the control operation corresponding to the target description information, further improves the success rate of semantic recognition.
S360:执行所述目标描述信息对应控制操作。S360: Execute a control operation corresponding to the target description information.
需要说明的是,在本申请实施例中,在执行S350的过程中若确定对应的相似度最大的第二拼音内容有多个的情况下,也可以获取第一拼音内容对应的文本向量作为第一文本向量。再者,也可以获取第三拼音内容对应的文本向量作为第一文本向量。而在获取第三拼音内容对应的文本向量作为第一文本向量的这种方式中,所获取到的第一文本向量可能会有多个,则计算多个第一文本向量各自与多个第二文本向量中每个第二文本向量之间的向量距离,进而将对应的向量距离最短的一个第二文本向量对应的描述信息作为目标描述信息。例如,若基于第三拼音内容来获取得到的多个第一文本向量包括第一文本向量L1、第一文本向量L2以及第一文本向量L3,多个第二文本向量包括第二文本向量L4以及第二文本向量L5。在计算向量距离的过程中,会计算第一文本向量L1分别与第二文本向量L4以及第二文本向量L5之间的距离,计算第一文本向量L2分别与第二文本向量L4以及第二文本向量L5之间的距离,以及第一文本向量L3分别与第二文本向量L4以及第二文本向量L5之间的距离。It should be noted that, in the embodiment of the present application, if it is determined that there are multiple second pinyin contents with the highest similarity in the process of executing S350, the text vector corresponding to the first pinyin content may also be obtained as the second pinyin content. A text vector. Furthermore, the text vector corresponding to the third pinyin content may also be acquired as the first text vector. However, in the method of obtaining the text vector corresponding to the third pinyin content as the first text vector, there may be multiple first text vectors obtained, and the calculation of multiple first text vectors and multiple second text vectors respectively The vector distance between each second text vector in the text vector, and then the description information corresponding to the second text vector with the shortest corresponding vector distance is used as the target description information. For example, if the multiple first text vectors obtained based on the third pinyin content include the first text vector L1, the first text vector L2, and the first text vector L3, the multiple second text vectors include the second text vector L4 and The second text vector L5. In the process of calculating the vector distance, the distance between the first text vector L1 and the second text vector L4 and the second text vector L5 will be calculated, and the distance between the first text vector L2 and the second text vector L4 and the second text vector L4 will be calculated respectively. The distance between the vectors L5, and the distances between the first text vector L3 and the second text vector L4 and the second text vector L5 respectively.
本实施例提供的一种语音控制方法,通过上述方式使得在获取得到由语音控制指令直接转换而来的音频内容后,在直接转换而来的音频内容与待选的描述信息的拼音内容无法成功匹配的情况下,可以再基于直接转换来的语音内容获取对应的相似的拼音内容与待选的描述信息的拼音内容进行匹配,从而使得提升了用户触发的语音控制指令成功匹配到描述信息的概率,进而有利于提升准确执行语音控制的概率。并且,在本实施例中,在没有第二拼音内容与所述第三拼音内容成功匹配的情况下,可以通过获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度,将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息,从而实现了解决用户对界面控件的描述出现删、改而难以匹配的问题,以及解决用户通过缩略语、别称的方式指代控件导致的难以匹配问题,进而使得可以执行目标描述信息对应控制操作,提高了准确执行语音控制的概率。In the voice control method provided by this embodiment, after the audio content directly converted from the voice control command is acquired, the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method. In the case of matching, the corresponding similar pinyin content can be obtained based on the directly converted voice content to match the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control. And, in this embodiment, in the case that there is no successful match between the second pinyin content and the third pinyin content, the similarities between multiple second pinyin content and the first pinyin content can be obtained respectively to obtain The similarity corresponding to each second pinyin content, and the description information corresponding to the second pinyin content with the highest similarity as the target description information, so as to solve the problem that the user's description of the interface control is difficult to match due to deletion or modification , and solve the problem of difficult matching caused by the user referring to the control through abbreviations and aliases, so that the control operation corresponding to the target description information can be performed, and the probability of accurately performing voice control is improved.
再者,本专利方案采用语义相似度的方式进行语音控制指令与描述信息进行匹配,通过大规模预训练模型对需要匹配的指令文本(语音控制指令所转换得到的文本)进行向量化,用向量的相似度来完成匹配,可以解决语音控制指令与描述信息的差异较大,但含义相同的问题。Furthermore, this patent scheme uses semantic similarity to match voice control instructions and description information, and vectorizes the instruction text to be matched (text converted from voice control instructions) through a large-scale pre-training model, and uses the vector The matching can be done by using the similarity of the voice control command, which can solve the problem that the voice control command and the description information are quite different, but have the same meaning.
为了更好地理解本申请的所有实施例的方案,下面对本申请语音控制方法的一种实现流程进行介绍。In order to better understand the solutions of all the embodiments of the present application, an implementation process of the voice control method of the present application will be introduced below.
请参阅图12,在执行步骤S4010获取第一拼音内容以及获取多个第二拼音内容后,可以将第一拼音内容与多个第二拼音内容进行匹配,第二拼音内容与所述第一拼音内容成功匹配时,则可以将对应的第二拼音内容与第一拼音内容成功匹配的描述信息作为目标描述信息,执行目标描述信息对应控制操作;第二拼音内容与所述第一拼音内容未成功匹配时,则可以执行获取第三拼音内容的操作。其中,可以根据表1查询第一拼音内容所包括的音素在音素扩展表中是否有对应的音素对应关系,将确定有音素对应关系的音素作为指定音素,并基于音素对应关系确定指定音素对应的相似音素,再用相似音素替换第一拼音内容中的指定音素,得到第三拼音内容。Please refer to FIG. 12, after performing step S4010 to obtain the first pinyin content and multiple second pinyin content, the first pinyin content can be matched with multiple second pinyin content, and the second pinyin content is matched with the first pinyin content When the content is successfully matched, the description information of the corresponding second pinyin content and the first pinyin content can be successfully matched as the target description information, and the corresponding control operation of the target description information is executed; the second pinyin content and the first pinyin content are not successful When matching, the operation of obtaining the content of the third pinyin can be performed. Wherein, it is possible to query according to Table 1 whether the phoneme included in the first pinyin content has a corresponding phoneme correspondence in the phoneme extension table, determine the phoneme with the phoneme correspondence as the specified phoneme, and determine the corresponding phoneme based on the phoneme correspondence. similar phonemes, and then replace the specified phonemes in the first pinyin content with similar phonemes to obtain the third pinyin content.
在执行步骤S4050获取第三拼音内容后,可以将第三拼音内容与多个第二拼音内容进行匹配,若有第二拼音内容与第三拼音内容成功匹配,则将对应的第二拼音内容与第三拼音内容成功匹配的描述信息作为目标描述信息;第二拼音内容与所述第三拼音内容未成功匹配时,则可以执行步骤S4090获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度,再将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息,执行目标描述信息对应控制操作。After performing step S4050 to obtain the third pinyin content, the third pinyin content can be matched with multiple second pinyin content, if there is a successful match between the second pinyin content and the third pinyin content, then the corresponding second pinyin content and The description information of the successful matching of the third pinyin content is used as the target description information; when the second pinyin content is not successfully matched with the third pinyin content, step S4090 can be executed to obtain a plurality of second pinyin content and the first pinyin content respectively to obtain the similarity corresponding to each second pinyin content, and then use the description information corresponding to the second pinyin content with the highest similarity as the target description information, and execute the control operation corresponding to the target description information.
其中,可以基于最长公共子序列和编辑距离的方式获取多个第二拼音内容分别与第一拼音内容的参考相似度,以得到每个第二拼音内容对应的参考相似度,若对应的相似度最大的第二拼音内容有一个,则将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息,执行目标描述信息对应控制操作;若对应的相似度最大的第二拼音内容有多个,则可以获取语 音控制指令对应的文本内容的文本向量作为第一文本向量和多个相似度最大的第二拼音内容各自对应的描述信息对应的文本向量以得到多个第二文本向量,再分别计算得到多个第二文本向量与第一文本向量的向量距离,以将对应的向量距离最小的一个第二文本向量对应的描述信息作为目标描述信息,执行目标描述信息对应控制操作。Among them, the reference similarities between multiple second pinyin content and the first pinyin content can be obtained based on the longest common subsequence and the edit distance, so as to obtain the reference similarity corresponding to each second pinyin content, if the corresponding similarity If there is one second pinyin content with the largest degree of similarity, the description information corresponding to the second pinyin content with the highest similarity degree is used as the target description information, and the corresponding control operation of the target description information is executed; if the corresponding second pinyin content with the highest degree of similarity If there are multiple, the text vector of the text content corresponding to the voice control instruction can be obtained as the first text vector and the text vector corresponding to the description information corresponding to the second pinyin content with the largest similarity to obtain multiple second text vectors , and then respectively calculate the vector distances between multiple second text vectors and the first text vector, so as to use the description information corresponding to the second text vector with the smallest corresponding vector distance as the target description information, and execute the control operation corresponding to the target description information.
请参阅图13,本申请提供的一种语音控制装置600,所述装置600包括:Please refer to FIG. 13 , a voice control device 600 provided by the present application, the device 600 includes:
第一拼音内容以及第二拼音内容获取单元610,用于获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息。The first pinyin content and the second pinyin content acquisition unit 610, configured to acquire the first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple The second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation.
第三拼音内容获取单元620,第二拼音内容与所述第一拼音内容未成功匹配时,用于获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容。The third pinyin content acquiring unit 620 is configured to acquire a third pinyin content when the second pinyin content fails to match the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content .
拼音内容匹配单元630,用于将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息。A pinyin content matching unit 630, configured to match the third pinyin content with the multiple second pinyin content, and use the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information.
控制操作执行单元640,用于执行所述目标描述信息对应控制操作。The control operation executing unit 640 is configured to execute the control operation corresponding to the target description information.
作为一种方式,第一拼音内容以及第二拼音内容获取单元610,具体用于获取目标界面所包括的多个控件各自的描述信息作为待选描述信息;将所述待选描述信息转换为对应的拼音内容,以得到多个第二拼音内容。As a method, the first pinyin content and the second pinyin content acquisition unit 610 is specifically configured to acquire the description information of multiple controls included in the target interface as description information to be selected; convert the description information to be selected into corresponding Pinyin content to get multiple second pinyin content.
作为一种方式,第三拼音内容获取单元620,具体用于获取所述第一拼音内容中的指定音素对应的相似音素;用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容。其中,所述相似音素有多个,可选的,第三拼音内容获取单元620,具体用于分别用多个所述相似音素对所述第一拼音内容中的指定音素进行替换,得到多个所述相似音素各自对应的进行音素替换后的第一拼音内容,以作为第三拼音内容。可选的,第三拼音内容获取单元620,具体用于将至少两个指定音素各自对应的相似音素相互进行组合,得到多个音素对,其中,每个音素对包括有所述至少两个指定音素各自对应的一个相似音素;分别基于所述多个音素对第一拼音内容中所对应的指定音素进行替换,得到每个音素对对应的第一替换拼音内容;用多个指定音素各自对应的相似音素对所述第一拼音内容中所对应的指定音素进行替换,得到每个指定音素对应的第二替换拼音内容;将所述第一替换拼音内容和所述第二替换拼音内容作为第三拼音内容。As a method, the third pinyin content acquisition unit 620 is specifically configured to acquire the similar phoneme corresponding to the specified phoneme in the first pinyin content; replace the specified phoneme in the first pinyin content with the similar phoneme to obtain The content of the third pinyin. Wherein, there are multiple similar phonemes. Optionally, the third pinyin content acquisition unit 620 is specifically used to replace the specified phonemes in the first pinyin content with multiple similar phonemes respectively, to obtain multiple Each of the similar phonemes corresponds to the first pinyin content after phoneme replacement, as the third pinyin content. Optionally, the third pinyin content acquisition unit 620 is specifically configured to combine similar phonemes corresponding to at least two specified phonemes to obtain a plurality of phoneme pairs, wherein each phoneme pair includes the at least two specified phonemes A similar phoneme corresponding to each phoneme; respectively, based on the plurality of phonemes, the corresponding specified phonemes in the first pinyin content are replaced to obtain the first replacement pinyin content corresponding to each phoneme pair; The similar phoneme replaces the specified phoneme corresponding to the first pinyin content to obtain the second replacement pinyin content corresponding to each specified phoneme; the first replacement pinyin content and the second replacement pinyin content are used as the third Pinyin content.
作为另一种方式,第三拼音内容获取单元620,具体用于查询第一拼音内容所包括的音素在音素扩展表中是否有对应的音素对应关系,每一个所述音素对应关系表征一对相似的音素;将确定有所述音素对应关系的音素作为指定音素,并基于所述音素对应关系确定指定音素对应的相似音素。As another way, the third pinyin content acquisition unit 620 is specifically used to inquire whether the phonemes included in the first pinyin content have corresponding phoneme correspondences in the phoneme extension table, and each of the phoneme correspondences represents a pair of similar The phoneme that has the phoneme corresponding relationship is determined as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
作为一种方式,拼音内容匹配单元630,具体用于将第一拼音内容与多个第二拼音内容进行匹配;第二拼音内容与所述第一拼音内容未成功匹配时,执行所述获取第三拼音内容。可选的,拼音内容匹配单元630,具体用于第二拼音内容与所述第一拼音内容成功匹配时,将对应的第二拼音内容与所述第一拼音内容成功匹配的描述信息作为目标描述信息;执行所述目标描述信息对应控制操作。As one way, the pinyin content matching unit 630 is specifically configured to match the first pinyin content with multiple second pinyin content; Three pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to use the description information that the corresponding second pinyin content successfully matches the first pinyin content as the target description when the second pinyin content is successfully matched with the first pinyin content information; execute the control operation corresponding to the target description information.
作为另一种方式,拼音内容匹配单元630,具体用于将所述第三拼音内容与所述多个第二拼音内容进行匹配,第二拼音内容与所述第三拼音内容成功匹配时将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息;第二拼音内容与所述第三拼音内容未成功匹配时,获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度;将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息。可选的,拼音内容匹配单元630,具体用于基于最长公共子序列的方式获取多个第二拼音内容分别与所述第一拼音内容的第一参考相似度,以得到每个第二拼音内容对应的第一参考相似度;基于编辑距离的方式获取多个第二拼音内容分别与所述第一拼音内容的第二参考相似度,以得到每个第二拼音内容对应的第二参考相似度;将每个第二拼音内容对应的第一参考相似度和第二参考相似度相加得到,每个第二拼音内容对应的相似度。可选的,拼音内容匹配单元630,具体用于若对应的相似度最大的第二拼音内容有一个,则将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息;若对应的相似度最大的第二拼音内容有多个,获取所述语音控制指令对应的文本内容的文本向量作为第一文本向量;获取多个相似度最大的第二拼音内容各自对应的描述信息对应的文本向量,以得到多个第二文本向量;分别计算得到多个第二文本向量与所述第一文本向量的向量距离;将对应的向量距离最小的一个第二文本向量对应的描述信息作为目标描述信息。As another way, the pinyin content matching unit 630 is specifically configured to match the third pinyin content with the plurality of second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, the corresponding The description information that successfully matches the second pinyin content and the third pinyin content is used as the target description information; when the second pinyin content fails to match the third pinyin content, obtain multiple second pinyin contents that The similarity of the first pinyin content is used to obtain the similarity corresponding to each second pinyin content; the description information corresponding to the second pinyin content with the highest similarity is used as the target description information. Optionally, the pinyin content matching unit 630 is specifically configured to obtain the first reference similarities between multiple second pinyin content and the first pinyin content based on the longest common subsequence, so as to obtain each second pinyin content The first reference similarity corresponding to the content; the second reference similarity between multiple second pinyin content and the first pinyin content is obtained based on the edit distance, so as to obtain the second reference similarity corresponding to each second pinyin content degree; the first reference similarity corresponding to each second pinyin content and the second reference similarity are added together to obtain the similarity corresponding to each second pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to, if there is one second pinyin content with the highest similarity, use the description information corresponding to the second pinyin content with the highest similarity as the target description information; There are multiple second pinyin contents with the largest similarity, and the text vector of the text content corresponding to the voice control instruction is obtained as the first text vector; Text vectors to obtain a plurality of second text vectors; respectively calculate the vector distances between a plurality of second text vectors and the first text vector; use the descriptive information corresponding to a second text vector with the smallest corresponding vector distance as the target Description.
下面将结合图14对本申请提供的一种电子设备进行说明。An electronic device provided by the present application will be described below with reference to FIG. 14 .
请参阅图14,基于上述的语音控制方法、装置,本申请实施例还提供的一种可以执行前述语音控制方法的电子设备1000。电子设备1000包括相互耦合的一个或多个(图中仅示出一个)处理器102、存储器104、摄像头106以及音频采集装置108。其中,该存储器104中存储有可以执行前述实施例中内容的程序,而处理器102可以执行该存储器104中存储的程序。Referring to FIG. 14 , based on the above-mentioned voice control method and apparatus, an embodiment of the present application also provides an electronic device 1000 capable of executing the aforementioned voice control method. The electronic device 1000 includes one or more (only one is shown in the figure) processors 102 , a memory 104 , a camera 106 and an audio collection device 108 coupled to each other. Wherein, the memory 104 stores programs capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104 .
其中,处理器102可以包括一个或者多个处理核。处理器102利用各种接口和线路连接整个电 子设备1000内的各个部分,通过运行或执行存储在存储器104内的指令、程序、代码集或指令集,以及调用存储在存储器104内的数据,执行电子设备1000的各种功能和处理数据。可选地,处理器102可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器102可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器102中,单独通过一块通信芯片进行实现。作为一种方式,处理器102可以为神经网络芯片。例如,可以为嵌入式神经网络芯片(NPU)。Wherein, the processor 102 may include one or more processing cores. The processor 102 uses various interfaces and circuits to connect various parts of the entire electronic device 1000, and executes or executes instructions, programs, code sets, or instruction sets stored in the memory 104, and calls data stored in the memory 104 to execute Various functions of the electronic device 1000 and processing data. Optionally, the processor 102 may adopt at least one of Digital Signal Processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA). implemented in the form of hardware. The processor 102 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface and application programs, etc.; the GPU is used to render and draw the displayed content; the modem is used to handle wireless communication. It can be understood that the above modem may also not be integrated into the processor 102, but implemented by a communication chip alone. As one manner, the processor 102 may be a neural network chip. For example, it may be an embedded neural network chip (NPU).
存储器104可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器104可用于存储指令、程序、代码、代码集或指令集。例如,存储器104中可以存储有装置。该装置可以为前述的装置600。存储器104可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。The memory 104 may include random access memory (Random Access Memory, RAM), and may also include read-only memory (Read-Only Memory). Memory 104 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. For example, a device may be stored in memory 104 . The device may be the aforementioned device 600 . The memory 104 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like.
再者,电子设备1000除了前述所示的器件外,还可以包括网络模块110以及传感器模块112。Furthermore, the electronic device 1000 may further include a network module 110 and a sensor module 112 in addition to the aforementioned devices.
所述网络模块110用于实现电子设备1000与其他设备之间的信息交互,例如,传输设备控制指令、操纵请求指令以及状态信息获取指令等。而当电子设备200具体为不同的设备时,其对应的网络模块110可能会有不同。The network module 110 is used to implement information interaction between the electronic device 1000 and other devices, for example, transmitting device control instructions, manipulation request instructions, and status information acquisition instructions. However, when the electronic device 200 is specifically a different device, its corresponding network module 110 may be different.
传感器模块112可以包括至少一种传感器。具体地,传感器模块112可包括但并不限于:水平仪、光传感器、运动传感器、压力传感器、红外热传感器、距离传感器、加速度传感器、以及其他传感器。The sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: a level, a light sensor, a motion sensor, a pressure sensor, an infrared heat sensor, a distance sensor, an acceleration sensor, and other sensors.
其中,压力传感器可以检测由按压在电子设备1000产生的压力的传感器。即,压力传感器检测由用户和电子设备之间的接触或按压产生的压力,例如由用户的耳朵与移动终端之间的接触或按压产生的压力。因此,压力传感器可以用来确定在用户与电子设备1000之间是否发生了接触或者按压,以及压力的大小。Wherein, the pressure sensor may be a sensor for detecting pressure generated by pressing on the electronic device 1000 . That is, the pressure sensor detects pressure generated by contact or press between the user and the electronic device, eg, contact or press between the user's ear and the mobile terminal. Therefore, the pressure sensor can be used to determine whether contact or pressure occurs between the user and the electronic device 1000, and the magnitude of the pressure.
其中,加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备1000姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等。另外,电子设备1000还可配置陀螺仪、气压计、湿度计、温度计等其他传感器,在此不再赘述。Among them, the acceleration sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when it is still, and can be used to identify the application of electronic equipment 1000 attitude (such as horizontal and vertical screen switching, related games, magnetometer, etc.) Attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc. In addition, the electronic device 1000 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, and a thermometer, which will not be repeated here.
音频采集装置110,用于进行音频信号采集。可选的,音频采集装置110包括有多个音频采集器件,该音频采集器件可以为麦克风。The audio collection device 110 is configured to collect audio signals. Optionally, the audio collection device 110 includes multiple audio collection devices, and the audio collection devices may be microphones.
作为一种方式,电子设备1000的网络模块为射频模块,该射频模块用于接收以及发送电磁波,实现电磁波与电信号的相互转换,从而与通讯网络或者其他设备进行通讯。所述射频模块可包括各种现有的用于执行这些功能的电路元件,例如,天线、射频收发器、数字信号处理器、加密/解密芯片、用户身份模块(SIM)卡、存储器等等。例如,该射频模块可以通过发送或者接收的电磁波与外部设备进行交互。例如,射频模块可以向目标设备发送指令。As one way, the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is used to receive and send electromagnetic waves, realize mutual conversion between electromagnetic waves and electrical signals, and communicate with a communication network or other devices. The radio frequency module may include various existing circuit elements for performing these functions, such as antenna, radio frequency transceiver, digital signal processor, encryption/decryption chip, Subscriber Identity Module (SIM) card, memory and so on. For example, the radio frequency module can interact with external devices by sending or receiving electromagnetic waves. For example, a radio frequency module can send instructions to a target device.
请参考图15,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读存储介质800中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。Please refer to FIG. 15 , which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. Program codes are stored in the computer-readable storage medium 800, and the program codes can be invoked by a processor to execute the methods described in the foregoing method embodiments.
计算机可读存储介质800可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质800包括非易失性计算机可读存储介质(non-transitory computer-readable storage medium)。计算机可读存储介质800具有执行上述方法中的任何方法步骤的程序代码810的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码810可以例如以适当形式进行压缩。The computer readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium). The computer-readable storage medium 800 has a storage space for program code 810 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products. Program code 810 may, for example, be compressed in a suitable form.
综上所述,本申请提供的一种语音控制方法、装置、电子设备以及可读存储介质,在获取语音控制指令对应的拼音内容作为第一拼音内容以及获取待选的描述信息的拼音内容作为多个第二拼音内容后,若确定没有第二拼音内容与所述第一拼音内容成功匹配,再获取与所述第一拼音内容相似的拼音内容作为第三拼音内容,然后将第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息,执行所述目标描述信息对应控制操作。To sum up, in the voice control method, device, electronic device and readable storage medium provided by the present application, the pinyin content corresponding to the voice control instruction is obtained as the first pinyin content and the pinyin content of the descriptive information to be selected is obtained as the first pinyin content. After a plurality of second pinyin content, if it is determined that there is no second pinyin content successfully matched with the first pinyin content, then obtain the pinyin content similar to the first pinyin content as the third pinyin content, and then use the third pinyin content Matching with the plurality of second pinyin contents, using the description information successfully matched between the corresponding second pinyin content and the third pinyin content as the target description information, and executing the control operation corresponding to the target description information.
从而通过上述方式使得在获取得到由语音控制指令直接转换而来的音频内容后,在直接转换而来的音频内容与待选的描述信息的拼音内容无法成功匹配的情况下,可以再基于直接转换来的语音内容获取对应的相似的拼音内容与待选的描述信息的拼音内容进行匹配,从而使得提升了用户触发的语音控制指令成功匹配到描述信息的概率,进而有利于提升准确执行语音控制的概率。Therefore, through the above method, after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion. The similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not drive the essence of the corresponding technical solutions away from the spirit and scope of the technical solutions of the various embodiments of the present application.

Claims (20)

  1. 一种语音控制方法,其特征在于,所述方法包括:A voice control method, characterized in that the method comprises:
    获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息;Acquiring the first pinyin content and acquiring a plurality of second pinyin contents, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, and the plurality of second pinyin contents include the pinyin content of the descriptive information to be selected, The description information is information used to describe the corresponding operation;
    第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容;Obtaining a third pinyin content when the second pinyin content fails to match the first pinyin content, the third pinyin content is pinyin content similar to the first pinyin content;
    将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息;Matching the third pinyin content with the plurality of second pinyin content, and using the description information that the corresponding second pinyin content successfully matches with the third pinyin content as the target description information;
    执行所述目标描述信息对应控制操作。Execute the control operation corresponding to the target description information.
  2. 根据权利要求1所述的方法,其特征在于,所述获取第三拼音内容,包括:The method according to claim 1, wherein said obtaining the third pinyin content comprises:
    获取所述第一拼音内容中的指定音素对应的相似音素;Obtain the similar phoneme corresponding to the specified phoneme in the first pinyin content;
    用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容。The specified phoneme in the first pinyin content is replaced with the similar phoneme to obtain the third pinyin content.
  3. 根据权利要求2所述的方法,其特征在于,所述相似音素有多个,所述用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容,包括:The method according to claim 2, wherein there are multiple similar phonemes, and replacing the specified phonemes in the first pinyin content with the similar phonemes to obtain the third pinyin content includes:
    分别用多个所述相似音素对所述第一拼音内容中的指定音素进行替换,得到多个所述相似音素各自对应的进行音素替换后的第一拼音内容,以作为第三拼音内容。The specified phonemes in the first pinyin content are respectively replaced with a plurality of similar phonemes to obtain phoneme-substituted first pinyin content corresponding to each of the plurality of similar phonemes as the third pinyin content.
  4. 根据权利要求2所述的方法,其特征在于,所述指定音素有多个,所述利用所述相似音素替换第一拼音内容中的所述指定音素,得到第三拼音内容,包括:The method according to claim 2, wherein there are multiple designated phonemes, and said similar phonemes are used to replace the designated phonemes in the first pinyin content to obtain the third pinyin content, including:
    将至少两个指定音素各自对应的相似音素相互进行组合,得到多个音素对,其中,每个音素对包括有所述至少两个指定音素各自对应的一个相似音素;combining similar phonemes corresponding to at least two designated phonemes with each other to obtain a plurality of phoneme pairs, wherein each phoneme pair includes a similar phoneme corresponding to each of the at least two designated phonemes;
    分别基于所述多个音素对第一拼音内容中所对应的指定音素进行替换,得到每个音素对对应的第一替换拼音内容;Respectively based on the plurality of phonemes, the corresponding designated phonemes in the first pinyin content are replaced to obtain the first replacement pinyin content corresponding to each phoneme pair;
    用多个指定音素各自对应的相似音素对所述第一拼音内容中所对应的指定音素进行替换,得到每个指定音素对应的第二替换拼音内容;Replacing the corresponding designated phonemes in the first pinyin content with similar phonemes corresponding to a plurality of designated phonemes to obtain a second replacement pinyin content corresponding to each designated phoneme;
    将所述第一替换拼音内容和所述第二替换拼音内容作为第三拼音内容。The first pinyin replacement content and the second pinyin replacement content are used as the third pinyin content.
  5. 根据权利要求2所述的方法,其特征在于,所述获取所述第一拼音内容中的指定音素对应的相似音素包括:The method according to claim 2, wherein said obtaining similar phonemes corresponding to specified phonemes in said first pinyin content comprises:
    查询第一拼音内容所包括的音素在音素扩展表中是否有对应的音素对应关系,每一个所述音素对应关系表征一对相似的音素;Querying whether the phonemes included in the first pinyin content have corresponding phoneme correspondences in the phoneme extension table, and each of the phoneme correspondences represents a pair of similar phonemes;
    将确定有所述音素对应关系的音素作为指定音素,并基于所述音素对应关系确定指定音素对应的相似音素。The phoneme determined to have the corresponding phoneme relationship is used as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
  6. 根据权利要求1所述的方法,其特征在于,所述获取第三拼音内容,包括:The method according to claim 1, wherein said obtaining the third pinyin content comprises:
    获取得到所述第一拼音内容的特征;Obtaining the characteristics of the first pinyin content;
    将所述第一拼音内容的特征与预先获取的参考特征分别进行比对,其中,所述参考特征为预先获取的多个词语所对应拼音内容的特征;Comparing the features of the first pinyin content with pre-acquired reference features respectively, wherein the reference features are features of pinyin content corresponding to a plurality of words acquired in advance;
    将所述参考特征中,对比成功的参考特征所对应的拼音内容作为第三拼音内容。Among the reference features, the pinyin content corresponding to the successfully compared reference features is used as the third pinyin content.
  7. 根据权利要求6所述的方法,其特征在于,所述比对成功的参考特征与第一拼音内容的特征相同。The method according to claim 6, characterized in that the reference feature of the successful comparison is the same as the feature of the first pinyin content.
  8. 根据权利要求6所述的方法,其特征在于,所述获取得到所述第一拼音内容的特征,包括:The method according to claim 6, wherein said acquiring features of said first pinyin content includes:
    通文本向量的方式获取得到所述第一拼音内容的特征;Obtain the feature that obtains described first pinyin content through the mode of text vector;
    其中,获取参考特征的方式与获取所述第一拼音内容的特征的方式相同。Wherein, the manner of acquiring the reference features is the same as the manner of acquiring the features of the first pinyin content.
  9. 根据权利要求1-8任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-8, wherein the method further comprises:
    第二拼音内容与所述第一拼音内容成功匹配时,将对应的第二拼音内容与所述第一拼音内容成功匹配的描述信息作为目标描述信息;When the second pinyin content is successfully matched with the first pinyin content, the description information of the corresponding second pinyin content successfully matched with the first pinyin content is used as the target description information;
    执行所述目标描述信息对应控制操作。Execute the control operation corresponding to the target description information.
  10. 根据权利要求1-9任一所述的方法,其特征在于,所述将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息,包括:The method according to any one of claims 1-9, wherein said matching said third pinyin content with said plurality of second pinyin content, and matching said second pinyin content with said first pinyin content The description information that successfully matches the three pinyin content is used as the target description information, including:
    将所述第三拼音内容与所述多个第二拼音内容进行匹配,第二拼音内容与所述第三拼音内容成功匹配时,将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述 信息;Matching the third pinyin content with the plurality of second pinyin content, when the second pinyin content is successfully matched with the third pinyin content, successfully matching the corresponding second pinyin content with the third pinyin content The description information of is used as the target description information;
    第二拼音内容与所述第三拼音内容未成功匹配时,获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度;When the second pinyin content does not successfully match the third pinyin content, obtain the similarities between a plurality of second pinyin content and the first pinyin content respectively, so as to obtain the similarity corresponding to each second pinyin content;
    将相似度最大的第二拼音内容对应的描述信息作为目标描述信息。The descriptive information corresponding to the second pinyin content with the highest similarity is used as the target descriptive information.
  11. 根据权利要求10所述的方法,其特征在于,所述获取多个第二拼音内容分别与所述第一拼音内容的相似度,以得到每个第二拼音内容对应的相似度,包括:The method according to claim 10, wherein said obtaining the similarities between a plurality of second pinyin content and said first pinyin content respectively, so as to obtain the corresponding similarity of each second pinyin content, comprises:
    基于最长公共子序列的方式获取多个第二拼音内容分别与所述第一拼音内容的第一参考相似度,以得到每个第二拼音内容对应的第一参考相似度;Obtaining the first reference similarities between the multiple second pinyin content and the first pinyin content respectively based on the longest common subsequence, so as to obtain the first reference similarity corresponding to each second pinyin content;
    基于编辑距离的方式获取多个第二拼音内容分别与所述第一拼音内容的第二参考相似度,以得到每个第二拼音内容对应的第二参考相似度;Obtaining second reference similarities between multiple second pinyin content and the first pinyin content based on edit distance, so as to obtain a second reference similarity corresponding to each second pinyin content;
    将每个第二拼音内容对应的第一参考相似度和第二参考相似度相加,得到每个第二拼音内容对应的相似度。The first reference similarity corresponding to each second pinyin content is added to the second reference similarity to obtain the similarity corresponding to each second pinyin content.
  12. 根据权利要求10所述的方法,其特征在于,所述将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息,包括:The method according to claim 10, wherein the descriptive information corresponding to the second pinyin content with the highest similarity as the target descriptive information includes:
    若对应的相似度最大的第二拼音内容有一个,则将对应的相似度最大的第二拼音内容对应的描述信息作为目标描述信息;If there is one corresponding second pinyin content with the highest similarity, then use the description information corresponding to the second pinyin content with the highest similarity as the target description information;
    若对应的相似度最大的第二拼音内容有多个,获取所述语音控制指令对应的文本内容的文本向量作为第一文本向量;If there are multiple second pinyin contents with the highest similarity, obtain the text vector of the text content corresponding to the voice control instruction as the first text vector;
    获取多个相似度最大的第二拼音内容各自对应的描述信息对应的文本向量,以得到多个第二文本向量;Obtain multiple text vectors corresponding to the description information corresponding to the second pinyin content with the highest similarity, so as to obtain multiple second text vectors;
    分别计算得到多个第二文本向量与所述第一文本向量的向量距离;respectively calculating the vector distances between multiple second text vectors and the first text vector;
    将对应的向量距离最小的一个第二文本向量对应的描述信息作为目标描述信息。The description information corresponding to a second text vector whose corresponding vector distance is the smallest is used as the target description information.
  13. 根据权利要求1-12任一所述的方法,其特征在于,所述获取多个第二拼音内容,包括:The method according to any one of claims 1-12, wherein said acquiring a plurality of second pinyin contents comprises:
    获取目标界面所包括的多个控件各自的描述信息作为待选描述信息;Acquiring description information of multiple controls included in the target interface as description information to be selected;
    将所述待选描述信息转换为对应的拼音内容,以得到多个第二拼音内容。The description information to be selected is converted into corresponding pinyin content to obtain a plurality of second pinyin content.
  14. 根据权利要求1-13任一所述的方法,其特征在于,所述目标描述信息为目标界面中控件所对应描述信息,所述执行所述目标描述信息对应控制操作,包括:The method according to any one of claims 1-13, wherein the target description information is the description information corresponding to the controls in the target interface, and the execution of the control operation corresponding to the target description information includes:
    获取所述目标描述信息对应控件所属的三元组;Acquiring the triplet to which the control corresponding to the target description information belongs;
    获取所述三元组中的用户意图和对象附属信息;Obtaining user intent and object attachment information in the triplet;
    基于所述用户意图和对象附属信息,以事件注入的方式执行与目标描述信息对应控制操作。Based on the user intention and object attachment information, the control operation corresponding to the target description information is executed in the manner of event injection.
  15. 根据权利要求1-13任一所述的方法,其特征在于,目标描述信息为目标界面中控件所对应描述信息,所述执行所述目标描述信息对应控制操作,包括:The method according to any one of claims 1-13, wherein the target description information is the description information corresponding to the controls in the target interface, and the execution of the control operation corresponding to the target description information includes:
    获取所述目标描述信息对应控件所属的三元组;Acquiring the triplet to which the control corresponding to the target description information belongs;
    获取所述三元组中的用户意图和对象附属信息;Obtaining user intent and object attachment information in the triplet;
    基于所述用户意图和对象附属信息,以模拟点击的方式执行与目标描述信息对应控制操作。Based on the user intention and object attachment information, a control operation corresponding to the target description information is executed in a manner of simulating a click.
  16. 根据权利要求1-13任一所述的方法,其特征在于,所述目标描述信息为界面整体操作指令对应的描述信息。The method according to any one of claims 1-13, wherein the target description information is description information corresponding to an overall interface operation instruction.
  17. 一种语音控制装置,其特征在于,所述装置包括:A voice control device, characterized in that the device comprises:
    第一拼音内容以及第二拼音内容获取单元,用于获取第一拼音内容以及获取多个第二拼音内容,所述第一拼音内容为所获取的语音控制指令对应的拼音内容,所述多个第二拼音内容包括待选的描述信息的拼音内容,所述描述信息为用于描述对应操作的信息;The first pinyin content and the second pinyin content acquisition unit are used to acquire the first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, and the multiple The second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation;
    第三拼音内容获取单元,用于第二拼音内容与所述第一拼音内容未成功匹配时获取第三拼音内容,所述第三拼音内容为与所述第一拼音内容相似的拼音内容;The third pinyin content acquisition unit is configured to acquire a third pinyin content when the second pinyin content fails to match the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content;
    拼音内容匹配单元,用于将所述第三拼音内容与所述多个第二拼音内容进行匹配,并将对应的第二拼音内容与所述第三拼音内容成功匹配的描述信息作为目标描述信息;A pinyin content matching unit, configured to match the third pinyin content with the plurality of second pinyin content, and use the description information that the corresponding second pinyin content successfully matches the third pinyin content as target description information ;
    控制操作执行单元,用于执行所述目标描述信息对应控制操作。A control operation executing unit, configured to execute the control operation corresponding to the target description information.
  18. 一种电子设备,其特征在于,包括一个或多个处理器以及存储器;An electronic device, characterized in that it includes one or more processors and memory;
    一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行权利要求1-16任一所述的方法。One or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any one of claims 1-16.
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序代码,其中,在所述程序代码运行时执行权利要求1-16任一所述的方法。A computer-readable storage medium, wherein a program code is stored in the computer-readable storage medium, wherein the method according to any one of claims 1-16 is executed when the program code is running.
  20. 一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现权利要求1-16任一所述方法的步骤。A computer program product, comprising computer programs/instructions, characterized in that, when the computer program/instructions are executed by a processor, the steps of the method described in any one of claims 1-16 are implemented.
PCT/CN2022/107788 2021-11-03 2022-07-26 Speech control method and apparatus, electronic device, and readable storage medium WO2023077878A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111296079.3 2021-11-03
CN202111296079.3A CN114049890A (en) 2021-11-03 2021-11-03 Voice control method and device and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023077878A1 true WO2023077878A1 (en) 2023-05-11

Family

ID=80207170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107788 WO2023077878A1 (en) 2021-11-03 2022-07-26 Speech control method and apparatus, electronic device, and readable storage medium

Country Status (2)

Country Link
CN (1) CN114049890A (en)
WO (1) WO2023077878A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789680A (en) * 2024-02-23 2024-03-29 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049890A (en) * 2021-11-03 2022-02-15 杭州逗酷软件科技有限公司 Voice control method and device and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7302389B2 (en) * 2003-05-14 2007-11-27 Lucent Technologies Inc. Automatic assessment of phonological processes
CN104238991A (en) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 Voice input matching method and voice input matching device
CN109360555A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of Internet of Things sound control method, device and storage medium
CN109741741A (en) * 2018-12-29 2019-05-10 深圳Tcl新技术有限公司 Control method, intelligent terminal and the computer readable storage medium of intelligent terminal
CN111554297A (en) * 2020-05-15 2020-08-18 北京百度网讯科技有限公司 Voice recognition method, device, equipment and readable storage medium
CN112114926A (en) * 2020-09-25 2020-12-22 北京百度网讯科技有限公司 Page operation method, device, equipment and medium based on voice recognition
CN112634903A (en) * 2020-12-15 2021-04-09 平安科技(深圳)有限公司 Quality inspection method, device, equipment and storage medium of service voice
CN114049890A (en) * 2021-11-03 2022-02-15 杭州逗酷软件科技有限公司 Voice control method and device and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7302389B2 (en) * 2003-05-14 2007-11-27 Lucent Technologies Inc. Automatic assessment of phonological processes
CN104238991A (en) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 Voice input matching method and voice input matching device
CN109360555A (en) * 2017-12-29 2019-02-19 广州Tcl智能家居科技有限公司 A kind of Internet of Things sound control method, device and storage medium
CN109741741A (en) * 2018-12-29 2019-05-10 深圳Tcl新技术有限公司 Control method, intelligent terminal and the computer readable storage medium of intelligent terminal
CN111554297A (en) * 2020-05-15 2020-08-18 北京百度网讯科技有限公司 Voice recognition method, device, equipment and readable storage medium
CN112114926A (en) * 2020-09-25 2020-12-22 北京百度网讯科技有限公司 Page operation method, device, equipment and medium based on voice recognition
CN112634903A (en) * 2020-12-15 2021-04-09 平安科技(深圳)有限公司 Quality inspection method, device, equipment and storage medium of service voice
CN114049890A (en) * 2021-11-03 2022-02-15 杭州逗酷软件科技有限公司 Voice control method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117789680A (en) * 2024-02-23 2024-03-29 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model
CN117789680B (en) * 2024-02-23 2024-05-24 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model

Also Published As

Publication number Publication date
CN114049890A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
JP7179273B2 (en) Translation model training methods, phrase translation methods, devices, storage media and computer programs
WO2023077878A1 (en) Speech control method and apparatus, electronic device, and readable storage medium
JP2022547704A (en) Intention recognition technology with reduced training
US11657799B2 (en) Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition
KR102332729B1 (en) Speech recognition method and apparatus, speech recognition engine generation method and apparatus based on pronounce similarity
US9348417B2 (en) Multimodal input system
KR102084646B1 (en) Device for recognizing voice and method for recognizing voice
KR20200007022A (en) Method, terminal, and storage medium for recognizing an image
US20150161997A1 (en) Using context to interpret natural language speech recognition commands
US20150235641A1 (en) Non-audible voice input correction
CN110164421B (en) Voice decoding method, device and storage medium
US20210210112A1 (en) Model Evaluation Method and Device, and Electronic Device
JP2021197133A (en) Meaning matching method, device, electronic apparatus, storage medium, and computer program
US11972761B2 (en) Electronic device for sharing user-specific voice command and method for controlling same
WO2023082703A1 (en) Voice control method and apparatus, electronic device, and readable storage medium
US20240105159A1 (en) Speech processing method and related device
US9460081B1 (en) Transcription correction using multi-token structures
US20220020358A1 (en) Electronic device for processing user utterance and operation method therefor
KR101370539B1 (en) Method and apparatus for dialog processing based on referring expressions processing
US20210233522A1 (en) Voice context-aware content manipulation
US11151995B2 (en) Electronic device for mapping an invoke word to a sequence of inputs for generating a personalized command
WO2023103917A1 (en) Speech control method and apparatus, and electronic device and storage medium
US20210151046A1 (en) Function performance based on input intonation
WO2023093280A1 (en) Speech control method and apparatus, electronic device, and storage medium
TW201506685A (en) Apparatus and method for selecting a control object by voice recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22888913

Country of ref document: EP

Kind code of ref document: A1