WO2023077878A1

WO2023077878A1 - Speech control method and apparatus, electronic device, and readable storage medium

Info

Publication number: WO2023077878A1
Application number: PCT/CN2022/107788
Authority: WO
Inventors: 曾理; 张晓帆
Original assignee: 杭州逗酷软件科技有限公司
Priority date: 2021-11-03
Filing date: 2022-07-26
Publication date: 2023-05-11
Also published as: CN114049890A

Abstract

A speech control method and apparatus, an electronic device, and a readable storage medium. The method comprises: obtaining first Pinyin content, and obtaining a plurality of pieces of second Pinyin content, the first Pinyin content being Pinyin content corresponding to an obtained speech control instruction, the plurality of pieces of second Pinyin content comprising Pinyin content of description information to be selected, and the description information being information for describing a corresponding operation (S110); when the second Pinyin content and the first Pinyin content are not successfully matched, obtaining third Pinyin content, the third Pinyin content being Pinyin content similar to the first Pinyin content (S120); matching the third Pinyin content with the plurality of pieces of second Pinyin content, and using, as target description information, description information that a corresponding piece of second Pinyin content successfully matches the third Pinyin content (S130); and executing a control operation corresponding to the target description information (S140).

Description

Voice control method, device, electronic device and readable storage medium

Cross References to Related Applications

This application claims priority to Chinese application No. 202111296079.3 filed on November 03, 2021, which is hereby incorporated by reference in its entirety for all purposes.

technical field

The present application relates to the field of computer technology, and more specifically, to a voice control method, device, electronic equipment and readable storage medium.

Background technique

Combining artificial intelligence technology and a virtual personal assistant (voice assistant), the electronic device can receive voice control instructions issued by the user through an auditory mode to realize voice control of the electronic device. However, in the relevant voice control process, there is still a problem that the probability of accurately executing the voice control needs to be improved.

Contents of the invention

In view of the above problems, the present application proposes a voice control method, device, electronic equipment and readable storage medium, so as to improve the above problems.

In a first aspect, the present application provides a voice control method, the method comprising: acquiring a first pinyin content and acquiring a plurality of second pinyin content, the first pinyin content being the pinyin content corresponding to the acquired voice control instruction , the plurality of second pinyin contents include pinyin contents of descriptive information to be selected, and the descriptive information is information used to describe corresponding operations; when the second pinyin contents fail to match with the first pinyin contents, the third Pinyin content, the third pinyin content is pinyin content similar to the first pinyin content; the third pinyin content is matched with the plurality of second pinyin content, and the corresponding second pinyin content is matched with The description information successfully matched by the third pinyin content is used as the target description information; and the control operation corresponding to the target description information is executed.

In a second aspect, the present application provides a voice control device, which includes: a first pinyin content and a second pinyin content acquiring unit, configured to acquire the first pinyin content and a plurality of second pinyin content, the first The first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation; the third pinyin content The obtaining unit is used to obtain the third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content; the pinyin content matching unit, For matching the third pinyin content with the plurality of second pinyin content, and using the description information successfully matching the corresponding second pinyin content with the third pinyin content as the target description information; the control operation execution unit , for executing a control operation corresponding to the target description information.

In a third aspect, the present application provides an electronic device, including one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, The one or more programs are configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, wherein the above method is executed when the program code is running.

In a fifth aspect, the present application provides a computer program product, including a computer program/instruction, which implements the steps of the above method when the computer program/instruction is executed by a processor.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.

FIG. 1 shows a schematic diagram of an application scenario of a voice control method proposed in an embodiment of the present application;

FIG. 2 shows a schematic diagram of an application scenario of another voice control method proposed in the embodiment of the present application;

FIG. 3 shows a flow chart of a voice control method proposed in an embodiment of the present application;

FIG. 4 shows a flowchart of an embodiment of S120 in FIG. 3 of the present application;

FIG. 5 shows a flow chart of a voice control method proposed in another embodiment of the present application;

FIG. 6 shows a flowchart of an embodiment of S230 in FIG. 5 of the present application;

Fig. 7 shows a schematic diagram of obtaining the first alternative pinyin content corresponding to each phoneme pair proposed by the present application;

Fig. 8 shows a schematic diagram of obtaining the second alternative pinyin content corresponding to each specified phoneme proposed by the present application;

FIG. 9 shows a flow chart of a voice control method proposed in another embodiment of the present application;

FIG. 10 shows a flowchart of an embodiment of S340 in FIG. 9 of the present application;

FIG. 11 shows a flowchart of an embodiment of S350 in FIG. 9 of the present application;

FIG. 12 shows a schematic diagram of the implementation process of a voice control method proposed in this application;

FIG. 13 shows a structural block diagram of a voice control device proposed in the embodiment of the present application;

Fig. 14 shows a structural block diagram of an electronic device proposed by the present application;

Fig. 15 is a storage unit for storing or carrying program codes for realizing the voice control method according to the embodiment of the present application according to the embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.

Combining artificial intelligence technology and virtual personal assistant (voice assistant), electronic devices can receive voice control commands issued by users through auditory mode, and then convert user voice control commands into text through Automatic Speech Recognition (ASR) technology Information, and then follow-up understanding and mapping, and then realize voice control of electronic devices.

However, the inventors found that in the related voice control process, due to the complex influence of various factors such as user accents in different regions, language habits, and noise interference in the process of giving instructions, there is still a problem that the probability of accurately executing voice control needs to be improved. . For example, the user's voice control command is recognized as a string of similar sounds, such as: "swipe up" is recognized as "shaohua", "swipe up" is recognized as "joke" and so on.

Therefore, the inventor proposes a voice control method, device, electronic equipment, and computer program product in the present application. The method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and acquires the pinyin content of the descriptive information to be selected. After the content is used as a plurality of second pinyin content, if it is determined that there is no second pinyin content successfully matched with the first pinyin content, then obtain the pinyin content similar to the first pinyin content as the third pinyin content, and then use the third pinyin content The pinyin content is matched with the multiple second pinyin content, and the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as the target description information, and the control operation corresponding to the target description information is executed.

Therefore, through the above method, after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion. The similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability.

The application scenarios involved in the embodiments of the present application are firstly introduced below.

In the embodiment of the present application, the provided voice control method may be executed by an electronic device. In this mode of being executed by the electronic device, all the steps in the voice control method provided in the embodiment of the present application may be executed by the electronic device. For example, as shown in FIG. 1, the voice collection device of the electronic device 100 can collect voice control instructions, and transmit the collected voice collection instructions and descriptive information to be selected to the processor, so that the processor can obtain the first Pinyin content and obtaining a plurality of second pinyin contents, and then the processor reuses the obtained first pinyin content, obtains a plurality of second pinyin contents and pinyin content (third pinyin content) similar to the first pinyin content to execute the application The steps involved in the provided voice control method.

Furthermore, the voice control method provided in the embodiment of the present application may also be executed by a server. Correspondingly, in the method executed by the server, the electronic device can collect voice control instructions, and send the collected voice control instructions to the server synchronously, and then the server will execute the voice control method provided by the embodiment of the application. The target description information is determined, and then the server generates an operation instruction according to the target description information. In addition, it can also be executed cooperatively by the electronic device and the server. In the way that the electronic device and the server cooperate to execute, some steps in the voice control method provided by the embodiment of the present application are executed by the electronic device, while other parts of the steps are executed by the server.

Exemplarily, as shown in FIG. 2 , the electronic device 100 may execute the voice control method including: acquiring first pinyin content and acquiring a plurality of second pinyin content, and then the server 200 performs subsequent steps. It should be noted that, in this method of cooperative execution by the electronic device and the server, the steps performed by the electronic device and the server respectively are not limited to the method described in the above examples. In practical applications, the electronic device can be dynamically adjusted according to the actual situation Steps performed by the device and the server respectively.

The embodiments involved in this application will be described below with reference to the accompanying drawings.

Please refer to Figure 3, a voice control method provided by the present application, the method includes:

S110: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.

In the embodiment of the present application, the user can express his control intention through voice. Correspondingly, the electronic device may use the voice uttered by the user as a voice control instruction. Optionally, the command here refers to the command for the user to manipulate the interactive interface or the elements on the interactive interface. The voice control command may include: swipe left, open today’s headlines, bilibili, play XXX, second row The third one, return, swipe up, install Douyin, next song, desktop, etc.

As a way, the first pinyin content can be obtained through Automatic Speech Recognition (ASR) technology and Natural Language Processing (NLP) technology.

Optionally, after the electronic device acquires the voice control command, it can transmit the user's voice control command to the ASR module to obtain the command text corresponding to the voice control command, and then use the pinyin content corresponding to the command text as the first pinyin content. Optionally, after obtaining the instruction text, the NLP module can also be used to extract the user intent, control object, and object attachment information in the instruction text, and integrate them into a triple in the form of {action, object, information}, where action represents User intent, object represents the control object, and information represents the object's auxiliary information.

In the triplet, user intent refers to the action the user wants to perform, such as: click, swipe, long press, etc. Auxiliary information refers to the information that may accompany the control object. For example, when inputting, the text box is the control object, and the text to be filled is the auxiliary information. It should be noted that the control object and auxiliary information are not necessarily mandatory. In the method of converting the instruction text into triplets, after obtaining the triplet corresponding to the voice control instruction, the pinyin corresponding to the control object in the triplet can be used as the first pinyin content, if three If the control object of the tuple is empty, the content corresponding to the user's intention can be used as the first pinyin content. Exemplarily, the user's voice control instruction can be "Open Toutiao", and the triplet that can be obtained through the ASR module and the NLP module is: {click, Toutiao, Φ}, where the user's intention is: "click", The control object is "Today's Headlines", and the object's auxiliary information is empty, then the first pinyin content is "jin ri tou tiao". Another example: the user instruction can be "swipe up", and the triplet that can be obtained through the ASR module and the NLP module is: {swipe up, Φ, Φ}, where the user's intention is "swipe up", and the control object is empty. If the auxiliary information of the object is also empty, the first pinyin content is "shang hua".

Furthermore, in the embodiment of the present application, the descriptive information to be selected may be a collection of descriptive information of operations that the electronic device can perform when the voice control instruction is acquired. The operations that can be performed by the electronic device may be operations performed on the entire electronic device, for example, shutting down, switching operation modes, or taking pictures. Furthermore, the operations that can be performed by the electronic device may include operations performed on the target interface. The target interface may be the interface currently displayed by the electronic device. Furthermore, in this way of operating on the target interface, the descriptive information to be selected may include the respective descriptive information of multiple controls in the target interface, for example: "Fenghuo Kangda", "Olympic Highlights", "Lonely The Eighth Season of Gourmet", etc. The descriptive information to be selected can also include descriptive information corresponding to all the overall interface operation commands, such as: swipe left, swipe right, swipe up, swipe down, return, desktop, double-click, long press, etc.

As a manner, the second pinyin content may be acquired by acquiring the pinyin content corresponding to all the description information to be selected. Optionally, description information of multiple controls included in the target interface may be acquired as candidate description information, and then the candidate description information may be converted into corresponding pinyin content to obtain multiple second pinyin content. Optionally, description information corresponding to all interface overall operation instructions may also be obtained as candidate description information, and then the candidate description information is converted into corresponding pinyin content to obtain multiple second pinyin content. Furthermore, the plurality of second pinyin contents may also include pinyin contents corresponding to the description information corresponding to the overall operation instruction of the interface, and pinyin contents corresponding to the respective description information of the multiple controls included in the target interface.

Among them, in the embodiment of the present application, the description information of multiple controls included in the target interface can be obtained through the system program. In this way, the electronic device can use the system program to analyze the code corresponding to the target interface, and can obtain each Information such as the type, position, and size of a control is used as the description information of the control.

It should be noted that there are many ways to obtain the pinyin corresponding to the text, for example: pypinyin, xpinyin in the Python library, pinyin4J in the Java library, etc. You can choose which method to use to implement the operation of converting text to pinyin according to the actual development environment .

S120: Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.

After the first pinyin content and the second pinyin content are obtained, it may be detected whether there is a second pinyin content that successfully matches the first pinyin content among the plurality of second pinyin content. Optionally, if the content of the second pinyin is completely the same as the content of the first pinyin, it is determined that the content of the second pinyin matches the content of the first pinyin successfully. For example, if the first pinyin content is "shao hua", and the second pinyin content currently matching "shao hua" is "shang hua", then because "ao" in the first pinyin content and the second pinyin content If the "ang" in is different, it is determined that the first pinyin content "shao hua" does not match the second pinyin content "shang hua".

In the embodiment of the present application, as shown in Figure 4, as a way to obtain the third pinyin content, including:

S121: Obtain a similar phoneme corresponding to a specified phoneme in the first pinyin content.

Wherein, as a method, whether there is a corresponding phoneme correspondence in the phoneme expansion table by querying the phonemes included in the first pinyin content, each of the phoneme correspondences represents a pair of similar phonemes; The phoneme corresponding to the phoneme is used as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.

Among them, a phoneme (phone) is the smallest unit of speech divided according to the natural properties of speech, and a pronunciation action forms a phoneme. In Chinese, phonemes can be divided into initials and finals. In the Chinese pinyin notation rules, when y is added in front of the final i and the compound finals beginning with i (such as: i, ia, ie, iao, iou, ian, in, iang, iong, etc.), it can be recorded as yi . yue, yuan, yun, ju, qu, xu; and when the initial consonant corresponding to the final ü is n, l, it can be written as nü, lü. Therefore, in some cases, u can be used instead of ü.

Moreover, due to the influence of user accents and language habits in different regions, the user may confuse some similar phonemes, resulting in inaccurate recognition of the user's voice control commands. Therefore, a phoneme expansion table as shown in Table 1 can be formed by combining the notation rules of Chinese Pinyin and common mistakes in Chinese pronunciation.

Table 1

Exemplarily, when the user's voice control instruction "swipe up" is recognized as the homophonic word "邵华" and "shao hua" is used as the first pinyin content, because the phonemes included in the first pinyin content are: sh, ao, h, ua, so the following phoneme correspondences can be obtained according to the phoneme expansion table: [sh, s], [sh, c], [sh, xi], [sh, zh], [ao, ou], [ao, iao ], [ao, ang], [h, f], then sh, ao, and h can be used as the specified phoneme, and based on the above-mentioned phoneme correspondence, the similar phonemes corresponding to the specified phoneme can be determined as: s, c, xi, zh, ou , iao, ang, f.

S122: Replace the specified phoneme in the first pinyin content with the similar phoneme to obtain a third pinyin content.

As another way, the pinyin content similar to the first pinyin content can also be obtained as the third pinyin content as a whole. In this way, the features of the pinyin content corresponding to multiple words can be directly obtained in advance as reference features. After obtaining the first pinyin content, the features of the first pinyin content can be obtained in the same way, and then the first pinyin The features of the content are compared with the pre-acquired reference features, and the pinyin content corresponding to the successfully compared reference features is used as the third pinyin content. Among them, the reference feature of the successful comparison is the same as the feature of the first pinyin content. In this way, the related methods of acquiring data features can be applied to acquire features of Pinyin content, and the specific way of acquiring features of Pinyin content is not specifically limited in this embodiment of the present application. For example, the features of pinyin content can be obtained by means of text vectors.

S130: Match the third pinyin content with the plurality of second pinyin content, and use the description information that the corresponding second pinyin content successfully matches with the third pinyin content as target description information.

Exemplarily, the content of the third pinyin can be: {"sao hua", "cao hua", "xiao hua", "zhao hua", "shou hua", "shiao hua", "shang hua", "shao fua "}, the content of the second pinyin can be: {"feng huo kang da (Fenghuo Kangda)", "ao yun ji jin (Olympic collection)", "gu du de mei shi jia di ba ji (the eighth Season)",..., "zuo hua", "you hua", "shang hua", "xia hua", "fan hui", "zhuo mian", "shuang ji", "chang an"}, then Match the content of the above third pinyin with the content of the second pinyin to obtain the target description information "shang hua".

S140: Execute a control operation corresponding to the target description information.

Among them, as a method, the target description information can be the description information corresponding to the control in the target interface, and can be combined with the user intent and object attachment information in the triple group corresponding to the control to which the target description information belongs, in the way of event injection or simulated click Execute the control operation corresponding to the target description information on the electronic device. For example: if the target description information is "sou suo kuang", you can combine the user intent and object attachment information in the triplet {search, search box, happy feet} to inject an event: enter happy in the search box Bigfoot can perform control operations corresponding to the target description information "sou suo kuang" on electronic devices. Another example: if the target description information is "ao yun ji jin", you can combine the user intention in the triplet {click, Olympic highlights, Φ}, and execute it on the electronic device by clicking the Olympic highlights control with the target description information " ao yun ji jin” corresponds to the control operation.

As another manner, the target description information may be description information corresponding to an overall interface operation instruction. For example: if the target description information is "shang hua", the operation of swiping up can be directly performed on the electronic device.

In the voice control method provided by this embodiment, after the method obtains the pinyin content corresponding to the voice control instruction as the first pinyin content and the pinyin content of the descriptive information to be selected as multiple second pinyin content, if it is determined that there is no second pinyin content The second pinyin content is successfully matched with the first pinyin content, and then the pinyin content similar to the first pinyin content is obtained as the third pinyin content, and then the third pinyin content is matched with the plurality of second pinyin content, Using the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information, the control operation corresponding to the target description information is executed.

Therefore, through the above method, after obtaining the audio content directly converted from the voice control command, if the directly converted audio content cannot successfully match the pinyin content of the descriptive information to be selected, it can be based on the direct conversion. The similar pinyin content corresponding to the incoming voice content is matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information, which in turn helps to improve the accuracy of accurately executing voice control. probability. And, in this embodiment, in combination with the concepts of phonemes in linguistics and acoustics, according to the common mistakes in Mandarin Chinese, the initial consonant and final consonant confusion expansion table is established, and the pinyin that cannot be accurately matched is fuzzy expanded, and then matched, thereby also solving the problem. The problem of homonym errors in the speech recognition process can also effectively solve the speech recognition errors caused by the user's non-standard pronunciation.

Please refer to Figure 5, a voice control method provided by the present application, the method includes:

S210: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.

S220: Obtain a similar phoneme corresponding to a specified phoneme in the first pinyin content.

S230: When the second pinyin content fails to match the first pinyin content, replace the specified phoneme in the first pinyin content with the similar phoneme to obtain a third pinyin content.

Wherein, in the embodiment of the present application, there may be multiple specified phonemes. As a method, the specified phonemes in the first pinyin content can be replaced with multiple similar phonemes, respectively, to obtain the first pinyin content after phoneme replacement corresponding to the multiple similar phonemes, as The content of the third pinyin.

Exemplarily, the first pinyin content can be "shao hua", then it can be seen from Table 1 that the specified phonemes of the first pinyin content "shao hua" can be sh, ao, h, wherein, the similar phoneme corresponding to sh is s , c, xi, zh, the similar phonemes corresponding to ao are ou, iao, ang, and the similar phonemes corresponding to h are f. Respectively replace the specified phonemes in the first pinyin content with multiple similar phonemes, and the third pinyin content that can be obtained is {"sao hua", "cao hua", "xiao hua", "zhao hua" , "shou hua", "shiao hua", "shang hua", "shao hua"}.

As another way, as shown in Figure 6, replace the specified phoneme in the first pinyin content with the similar phoneme to obtain the third pinyin content, including:

S231: Combine similar phonemes corresponding to at least two specified phonemes with each other to obtain multiple phoneme pairs, where each phoneme pair includes a similar phoneme corresponding to each of the at least two specified phonemes.

Wherein, in the embodiment of the present application, similar phonemes corresponding to at least two specified phonemes may be combined with each other according to the combination manner shown in FIG. 7 . Referring to Fig. 5, designated phoneme A corresponds to similar phonemes O, P, Q, and designated phoneme B corresponds to similar phonemes R, S, T, and the first pinyin content is ABC, then each similar phoneme of designated phoneme A can be combined with All similar phonemes of the specified phoneme B are combined one by one to obtain the following phoneme pairs: OR, OS, OT, PR, PS, PT, QR, QS, QT. Exemplarily, the first pinyin content can be "shao hua", and the first pinyin content "shao hua" can be selected to be combined with similar phonemes corresponding to sh and ao in the specified phonemes in the combination manner shown in Figure 5, The following phoneme pairs are obtained: sou, siao, sang, cou, ciao, cang, xiou, ..., zhang.

S232: Respectively replace the specified phonemes corresponding to the first pinyin content based on the plurality of phonemes to obtain the first replaced pinyin content corresponding to each phoneme pair.

Wherein, in the embodiment of the present application, as shown in FIG. 7, after obtaining a plurality of phoneme pairs (OR, OS, OT, PR, PS, PT, QR, QS, QT), by respectively based on the above-mentioned plurality of phoneme pairs The corresponding specified phoneme in the first pinyin content ABC is replaced, and the first replacement pinyin content that can be obtained is: ORC, OSC, OTC, PRC, PSC, PTC, QRC, QSC, QTC. Exemplarily, if the phoneme pair is sou, the corresponding first replacement pinyin content is "sou hua", and if the phoneme pair is cang, the corresponding first replacement pinyin content is "cang hua".

S233: Replace the corresponding designated phonemes in the first pinyin content with similar phonemes corresponding to the plurality of designated phonemes to obtain a second replaced pinyin content corresponding to each designated phoneme.

Wherein, in the embodiment of the present application, the specified phonemes in the first pinyin content may be replaced in the manner shown in FIG. 8 to obtain the second replaced pinyin content corresponding to each specified phoneme. Please refer to Figure 8, the specified phoneme A corresponds to similar phonemes O, P, Q, the specified phoneme B corresponds to similar phonemes R, S, T, and the first pinyin content is ABC, then you can use similar phonemes O, P, Q to pair one by one Replace the specified phoneme A to obtain the second replacement pinyin content OBC, PBC, QBC corresponding to the specified phoneme A, and then replace the specified phoneme B with similar phonemes R, S, T to obtain the second replacement pinyin content ARC corresponding to the specified phoneme B , ASC, ATC. Exemplary, as can be seen from Table 1, the first pinyin content can be the specified phoneme of "shao hua" as sh, ao, h, then the second replacement pinyin content corresponding to sh is {"sao hua", "cao hua", "xiao hua", "zhao hua"}, the content of the second alternate pinyin corresponding to ao is {"shou hua", "shiao hua", "shang hua", the second alternate pinyin content corresponding to h is {"shao fua" }.

S234: Use the first replaced pinyin content and the second replaced pinyin content as the third pinyin content.

Compared with the first way of obtaining the content of the third pinyin, by using the content of the first alternate pinyin and the content of the second alternate pinyin as the content of the third pinyin, the similarity of the first pinyin content can be further expanded, so that the content of the first pinyin can be compared with the second The scope for matching the pinyin content is further expanded, thereby increasing the probability of successful matching.

S240: Match the third pinyin content with the multiple second pinyin content, and use the description information of the corresponding second pinyin content successfully matched with the third pinyin content as target description information.

S250: Execute a control operation corresponding to the target description information.

In the voice control method provided by this embodiment, after the audio content directly converted from the voice control command is acquired, the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method. In the case of matching, the corresponding similar pinyin content can be obtained based on the directly converted voice content and matched with the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control. Moreover, in this embodiment, the similar phonemes of the specified phonemes can be obtained by querying the phoneme expansion table, and the third pinyin content can be obtained by replacing multiple specified phonemes with similar phonemes in various ways, because the third pinyin content is in the first The similar expansion based on the content of a pinyin increases the matching range, improves the probability of successful matching, and further increases the probability of accurately executing voice control.

Please refer to FIG. 9, a voice control method provided by the present application is applied to electronic equipment, and the method includes:

S310: Obtain first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control command, and the multiple second pinyin content includes the pinyin of the descriptive information to be selected content, the description information is information used to describe the corresponding operation.

S320: Obtain a third pinyin content when the second pinyin content fails to match the first pinyin content, where the third pinyin content is pinyin content similar to the first pinyin content.

S330: Match the third pinyin content with the multiple second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, match the corresponding second pinyin content with the third pinyin content successfully The matching description information is used as the target description information.

S340: When the second pinyin content does not match the third pinyin content successfully, obtain the similarities between multiple second pinyin content and the first pinyin content respectively, so as to obtain the similarity corresponding to each second pinyin content.

Wherein, as shown in FIG. 10 , obtaining the similarities between a plurality of second pinyin content and the first pinyin content respectively, so as to obtain the corresponding similarity of each second pinyin content may include:

S341: Acquire first reference similarities between multiple second pinyin contents and the first pinyin contents based on the longest common subsequence, so as to obtain a first reference similarity corresponding to each second pinyin content.

Wherein, in the embodiment of the present application, the first reference similarity between a plurality of second pinyin content and the first pinyin content can be measured by the longest common subsequence (Longest Common Subsequence, LCS), and the calculation formula of LCS can be:

Among them, A _i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A. Similarly, B _j can represent the first j characters of string B The value range of j is from 0 to the maximum length of string B, and a _i and b _j can represent the i-th and j-th characters in A and B, respectively. For example, character string A can be used to represent the first pinyin content, and character string B can represent a second pinyin content, the length of the first pinyin content is 10, and the length of the second pinyin content is 9, then the value range of i is 0~10, the value range of j is 0~9, if a ₁₀ =b ₉ , then LCS(A ₁₀ ,B ₉ )=LCS(A ₉ ,B ₈ )+a ₁₀ , otherwise LCS(A ₁₀ ,B ₉ )=max{LCS(A ₁₀ , B ₈ ), LCS(A ₉ , B ₉ )}.

LCS similarity can be defined as:

Among them, |A| and |B| can represent the lengths of strings A and B respectively, that is, the number of all characters in A and B. Exemplarily, the character string A may be "APPLE13", then |A|=7.

S342: Obtain second reference similarities between the plurality of second pinyin contents and the first pinyin contents based on edit distance, so as to obtain a second reference similarity corresponding to each second pinyin content.

Wherein, in the embodiment of the present application, the degree of difference between a plurality of second pinyin content and the first pinyin content can be measured by editing distance (Levenshtein Distance, LEV). The above formula measures the second reference similarity between the multiple second pinyin content and the first pinyin content respectively.

Among them, the calculation formula of LEV can be:

Among them, A _i can represent a string composed of the first i characters of string A, and the value range of i is from 0 to the maximum length of string A. Similarly, B _j can represent the first j characters of string B The string formed by j is in the range of 0 to the maximum length of string B. For example, character string A can be used to represent the first pinyin content, and character string B can represent a second pinyin content, the length of the first pinyin content is 10, and the length of the second pinyin content is 9, then the value range of i is 0～10, the value range of j is 0～9, if a ₁₀ =b ₉ , then LEV(A ₁₀ ,B ₉ )=min{LEV(A ₉ ,B ₁₀ )+1, LEV(A ₁₀ ,B ₉ )+1, LEV(A ₉ ,B ₈ )}, otherwise LEV(A ₁₀ ,B ₉ )=min{LEV(A ₉ ,B ₁₀ )+1,LEV(A ₁₀ ,B ₉ )+1,LEV (A ₉ ,B ₈ )+1}.

S343: Add the first reference similarity corresponding to each second pinyin content and the second reference similarity to obtain the similarity corresponding to each second pinyin content.

Wherein, as a method, the similarity corresponding to each second pinyin content can be obtained by directly adding the first reference similarity and the second reference similarity, and the calculation formula is as follows:

S(A,B)＝S _LCS (A,B)+S _LEV (A,B)

As another way, the weights corresponding to the first reference similarity and the second reference similarity can be assigned respectively, and the weights of the first reference similarity and the second reference similarity can be added to obtain each second pinyin content corresponding to The similarity is calculated by the following formula:

S(A,B)＝X×S _LCS (A,B)+Y×S _LEV (A,B)

Wherein, X+Y=1.

S350: Use the description information corresponding to the second pinyin content with the highest similarity as the target description information.

Wherein, as shown in FIG. 11 , the descriptive information corresponding to the second pinyin content with the highest similarity is used as the target descriptive information, including:

S351: If there is one second pinyin content with the highest similarity, use the description information corresponding to the second pinyin content with the highest similarity as the target description information.

S352: If there are multiple second pinyin contents with the highest similarity, acquire a text vector of the text content corresponding to the voice control instruction as a first text vector.

Among them, in the embodiment of the present application, some abbreviations or abbreviations may appear in the user's voice control instructions, which may result in obtaining multiple most similar results by means of the longest common subsequence and edit distance. For example: the user's voice control command is "Fulian", the second pinyin content set includes {"Avengers 4", "Copy a few couplets"}, "Fulian" and the longest common subtitle of the two objects to be matched The sequences are all "multilinks" and the edit distance is 4, so the calculated similarities are the same, and a unique result cannot be determined. Another example: the user's voice control command is "B station", and the second pinyin content set includes {哔哩哔哩, Q Music, A Cloud Music, B Music}, and no matching result can be obtained. In this case, the similarities between the multiple second pinyin contents and the first pinyin contents can be measured based on the semantic similarity, so as to obtain the most similar second pinyin contents.

Among them, as a way, the text vector can be obtained through the pre-training model BERT. BERT is a deep neural network that can input the text to be processed into the encoder part of BERT to obtain the corresponding text vector.

Wherein, in the embodiment of the present application, the text input corresponding to the first text vector may be the text content corresponding to the voice control instruction obtained through the ASR module, or may be the corresponding text content of the voice control instruction obtained through the ASR module and the NLP module. The text content of the triplet can also be the text content corresponding to the third pinyin.

S353: Obtain multiple text vectors corresponding to the description information corresponding to the second pinyin content with the highest similarity, so as to obtain multiple second text vectors.

Wherein, in the embodiment of the present application, the text input corresponding to the second text vector may be the respective text description information of multiple controls in the target interface obtained through the system program, or may be the text description information of the overall operation instruction of the interface, For example: swipe left, swipe right, swipe up, swipe down, back, desktop, double click, long press, etc.

It should be noted that the text input corresponding to the text vector can be a Chinese character string or a pinyin string.

Furthermore, it should be noted that in the embodiment of the present application, text vectors can also be obtained through tools such as Doc2Vec (document-to-vector), or open-source pre-training models such as RoBERTA, UniLM, ELECTRA, and XLNet.

S354: Calculate respectively the vector distances between the multiple second text vectors and the first text vector.

Wherein, as a method, the vector distance between each second text vector and the first text vector is calculated by cosine similarity, and the calculation formula is as follows:

S355: Use the description information corresponding to a second text vector with the smallest corresponding vector distance as the target description information.

Among them, as a method, after obtaining the vector distances between multiple second text vectors and the first text vector, the magnitudes of the multiple vector distances can be sorted, and the description information corresponding to the second text vector with the smallest vector distance as the target description information.

It should be noted that since the text vectors are continuously distributed in the high-dimensional space, the probability of two text vectors with the same similarity value is negligible. Therefore, the description information corresponding to the unique second text vector can be determined as the target Description.

Through the above method, when a unique matching result cannot be obtained due to the existence of abbreviations or abbreviations in the user's voice control command, the vector distance between multiple second text vectors and the first text vector can be calculated to obtain a unique matching result The corresponding target description information, so as to execute the control operation corresponding to the target description information, further improves the success rate of semantic recognition.

S360: Execute a control operation corresponding to the target description information.

It should be noted that, in the embodiment of the present application, if it is determined that there are multiple second pinyin contents with the highest similarity in the process of executing S350, the text vector corresponding to the first pinyin content may also be obtained as the second pinyin content. A text vector. Furthermore, the text vector corresponding to the third pinyin content may also be acquired as the first text vector. However, in the method of obtaining the text vector corresponding to the third pinyin content as the first text vector, there may be multiple first text vectors obtained, and the calculation of multiple first text vectors and multiple second text vectors respectively The vector distance between each second text vector in the text vector, and then the description information corresponding to the second text vector with the shortest corresponding vector distance is used as the target description information. For example, if the multiple first text vectors obtained based on the third pinyin content include the first text vector L1, the first text vector L2, and the first text vector L3, the multiple second text vectors include the second text vector L4 and The second text vector L5. In the process of calculating the vector distance, the distance between the first text vector L1 and the second text vector L4 and the second text vector L5 will be calculated, and the distance between the first text vector L2 and the second text vector L4 and the second text vector L4 will be calculated respectively. The distance between the vectors L5, and the distances between the first text vector L3 and the second text vector L4 and the second text vector L5 respectively.

In the voice control method provided by this embodiment, after the audio content directly converted from the voice control command is acquired, the directly converted audio content and the pinyin content of the descriptive information to be selected cannot be successfully obtained through the above method. In the case of matching, the corresponding similar pinyin content can be obtained based on the directly converted voice content to match the pinyin content of the descriptive information to be selected, thereby improving the probability that the voice control command triggered by the user is successfully matched to the descriptive information , which in turn helps to improve the probability of accurately executing the voice control. And, in this embodiment, in the case that there is no successful match between the second pinyin content and the third pinyin content, the similarities between multiple second pinyin content and the first pinyin content can be obtained respectively to obtain The similarity corresponding to each second pinyin content, and the description information corresponding to the second pinyin content with the highest similarity as the target description information, so as to solve the problem that the user's description of the interface control is difficult to match due to deletion or modification , and solve the problem of difficult matching caused by the user referring to the control through abbreviations and aliases, so that the control operation corresponding to the target description information can be performed, and the probability of accurately performing voice control is improved.

Furthermore, this patent scheme uses semantic similarity to match voice control instructions and description information, and vectorizes the instruction text to be matched (text converted from voice control instructions) through a large-scale pre-training model, and uses the vector The matching can be done by using the similarity of the voice control command, which can solve the problem that the voice control command and the description information are quite different, but have the same meaning.

In order to better understand the solutions of all the embodiments of the present application, an implementation process of the voice control method of the present application will be introduced below.

Please refer to FIG. 12, after performing step S4010 to obtain the first pinyin content and multiple second pinyin content, the first pinyin content can be matched with multiple second pinyin content, and the second pinyin content is matched with the first pinyin content When the content is successfully matched, the description information of the corresponding second pinyin content and the first pinyin content can be successfully matched as the target description information, and the corresponding control operation of the target description information is executed; the second pinyin content and the first pinyin content are not successful When matching, the operation of obtaining the content of the third pinyin can be performed. Wherein, it is possible to query according to Table 1 whether the phoneme included in the first pinyin content has a corresponding phoneme correspondence in the phoneme extension table, determine the phoneme with the phoneme correspondence as the specified phoneme, and determine the corresponding phoneme based on the phoneme correspondence. similar phonemes, and then replace the specified phonemes in the first pinyin content with similar phonemes to obtain the third pinyin content.

After performing step S4050 to obtain the third pinyin content, the third pinyin content can be matched with multiple second pinyin content, if there is a successful match between the second pinyin content and the third pinyin content, then the corresponding second pinyin content and The description information of the successful matching of the third pinyin content is used as the target description information; when the second pinyin content is not successfully matched with the third pinyin content, step S4090 can be executed to obtain a plurality of second pinyin content and the first pinyin content respectively to obtain the similarity corresponding to each second pinyin content, and then use the description information corresponding to the second pinyin content with the highest similarity as the target description information, and execute the control operation corresponding to the target description information.

Among them, the reference similarities between multiple second pinyin content and the first pinyin content can be obtained based on the longest common subsequence and the edit distance, so as to obtain the reference similarity corresponding to each second pinyin content, if the corresponding similarity If there is one second pinyin content with the largest degree of similarity, the description information corresponding to the second pinyin content with the highest similarity degree is used as the target description information, and the corresponding control operation of the target description information is executed; if the corresponding second pinyin content with the highest degree of similarity If there are multiple, the text vector of the text content corresponding to the voice control instruction can be obtained as the first text vector and the text vector corresponding to the description information corresponding to the second pinyin content with the largest similarity to obtain multiple second text vectors , and then respectively calculate the vector distances between multiple second text vectors and the first text vector, so as to use the description information corresponding to the second text vector with the smallest corresponding vector distance as the target description information, and execute the control operation corresponding to the target description information.

Please refer to FIG. 13 , a voice control device 600 provided by the present application, the device 600 includes:

The first pinyin content and the second pinyin content acquisition unit 610, configured to acquire the first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the multiple The second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation.

The third pinyin content acquiring unit 620 is configured to acquire a third pinyin content when the second pinyin content fails to match the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content .

A pinyin content matching unit 630, configured to match the third pinyin content with the multiple second pinyin content, and use the description information that the corresponding second pinyin content successfully matches the third pinyin content as the target description information.

The control operation executing unit 640 is configured to execute the control operation corresponding to the target description information.

As a method, the first pinyin content and the second pinyin content acquisition unit 610 is specifically configured to acquire the description information of multiple controls included in the target interface as description information to be selected; convert the description information to be selected into corresponding Pinyin content to get multiple second pinyin content.

As a method, the third pinyin content acquisition unit 620 is specifically configured to acquire the similar phoneme corresponding to the specified phoneme in the first pinyin content; replace the specified phoneme in the first pinyin content with the similar phoneme to obtain The content of the third pinyin. Wherein, there are multiple similar phonemes. Optionally, the third pinyin content acquisition unit 620 is specifically used to replace the specified phonemes in the first pinyin content with multiple similar phonemes respectively, to obtain multiple Each of the similar phonemes corresponds to the first pinyin content after phoneme replacement, as the third pinyin content. Optionally, the third pinyin content acquisition unit 620 is specifically configured to combine similar phonemes corresponding to at least two specified phonemes to obtain a plurality of phoneme pairs, wherein each phoneme pair includes the at least two specified phonemes A similar phoneme corresponding to each phoneme; respectively, based on the plurality of phonemes, the corresponding specified phonemes in the first pinyin content are replaced to obtain the first replacement pinyin content corresponding to each phoneme pair; The similar phoneme replaces the specified phoneme corresponding to the first pinyin content to obtain the second replacement pinyin content corresponding to each specified phoneme; the first replacement pinyin content and the second replacement pinyin content are used as the third Pinyin content.

As another way, the third pinyin content acquisition unit 620 is specifically used to inquire whether the phonemes included in the first pinyin content have corresponding phoneme correspondences in the phoneme extension table, and each of the phoneme correspondences represents a pair of similar The phoneme that has the phoneme corresponding relationship is determined as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.

As one way, the pinyin content matching unit 630 is specifically configured to match the first pinyin content with multiple second pinyin content; Three pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to use the description information that the corresponding second pinyin content successfully matches the first pinyin content as the target description when the second pinyin content is successfully matched with the first pinyin content information; execute the control operation corresponding to the target description information.

As another way, the pinyin content matching unit 630 is specifically configured to match the third pinyin content with the plurality of second pinyin content, and when the second pinyin content is successfully matched with the third pinyin content, the corresponding The description information that successfully matches the second pinyin content and the third pinyin content is used as the target description information; when the second pinyin content fails to match the third pinyin content, obtain multiple second pinyin contents that The similarity of the first pinyin content is used to obtain the similarity corresponding to each second pinyin content; the description information corresponding to the second pinyin content with the highest similarity is used as the target description information. Optionally, the pinyin content matching unit 630 is specifically configured to obtain the first reference similarities between multiple second pinyin content and the first pinyin content based on the longest common subsequence, so as to obtain each second pinyin content The first reference similarity corresponding to the content; the second reference similarity between multiple second pinyin content and the first pinyin content is obtained based on the edit distance, so as to obtain the second reference similarity corresponding to each second pinyin content degree; the first reference similarity corresponding to each second pinyin content and the second reference similarity are added together to obtain the similarity corresponding to each second pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to, if there is one second pinyin content with the highest similarity, use the description information corresponding to the second pinyin content with the highest similarity as the target description information; There are multiple second pinyin contents with the largest similarity, and the text vector of the text content corresponding to the voice control instruction is obtained as the first text vector; Text vectors to obtain a plurality of second text vectors; respectively calculate the vector distances between a plurality of second text vectors and the first text vector; use the descriptive information corresponding to a second text vector with the smallest corresponding vector distance as the target Description.

An electronic device provided by the present application will be described below with reference to FIG. 14 .

Referring to FIG. 14 , based on the above-mentioned voice control method and apparatus, an embodiment of the present application also provides an electronic device 1000 capable of executing the aforementioned voice control method. The electronic device 1000 includes one or more (only one is shown in the figure) processors 102 , a memory 104 , a camera 106 and an audio collection device 108 coupled to each other. Wherein, the memory 104 stores programs capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104 .

Wherein, the processor 102 may include one or more processing cores. The processor 102 uses various interfaces and circuits to connect various parts of the entire electronic device 1000, and executes or executes instructions, programs, code sets, or instruction sets stored in the memory 104, and calls data stored in the memory 104 to execute Various functions of the electronic device 1000 and processing data. Optionally, the processor 102 may adopt at least one of Digital Signal Processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA). implemented in the form of hardware. The processor 102 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface and application programs, etc.; the GPU is used to render and draw the displayed content; the modem is used to handle wireless communication. It can be understood that the above modem may also not be integrated into the processor 102, but implemented by a communication chip alone. As one manner, the processor 102 may be a neural network chip. For example, it may be an embedded neural network chip (NPU).

The memory 104 may include random access memory (Random Access Memory, RAM), and may also include read-only memory (Read-Only Memory). Memory 104 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. For example, a device may be stored in memory 104 . The device may be the aforementioned device 600 . The memory 104 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like.

Furthermore, the electronic device 1000 may further include a network module 110 and a sensor module 112 in addition to the aforementioned devices.

The network module 110 is used to implement information interaction between the electronic device 1000 and other devices, for example, transmitting device control instructions, manipulation request instructions, and status information acquisition instructions. However, when the electronic device 200 is specifically a different device, its corresponding network module 110 may be different.

The sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: a level, a light sensor, a motion sensor, a pressure sensor, an infrared heat sensor, a distance sensor, an acceleration sensor, and other sensors.

Wherein, the pressure sensor may be a sensor for detecting pressure generated by pressing on the electronic device 1000 . That is, the pressure sensor detects pressure generated by contact or press between the user and the electronic device, eg, contact or press between the user's ear and the mobile terminal. Therefore, the pressure sensor can be used to determine whether contact or pressure occurs between the user and the electronic device 1000, and the magnitude of the pressure.

Among them, the acceleration sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when it is still, and can be used to identify the application of electronic equipment 1000 attitude (such as horizontal and vertical screen switching, related games, magnetometer, etc.) Attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc. In addition, the electronic device 1000 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, and a thermometer, which will not be repeated here.

The audio collection device 110 is configured to collect audio signals. Optionally, the audio collection device 110 includes multiple audio collection devices, and the audio collection devices may be microphones.

As one way, the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is used to receive and send electromagnetic waves, realize mutual conversion between electromagnetic waves and electrical signals, and communicate with a communication network or other devices. The radio frequency module may include various existing circuit elements for performing these functions, such as antenna, radio frequency transceiver, digital signal processor, encryption/decryption chip, Subscriber Identity Module (SIM) card, memory and so on. For example, the radio frequency module can interact with external devices by sending or receiving electromagnetic waves. For example, a radio frequency module can send instructions to a target device.

Please refer to FIG. 15 , which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. Program codes are stored in the computer-readable storage medium 800, and the program codes can be invoked by a processor to execute the methods described in the foregoing method embodiments.

The computer readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium). The computer-readable storage medium 800 has a storage space for program code 810 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products. Program code 810 may, for example, be compressed in a suitable form.

To sum up, in the voice control method, device, electronic device and readable storage medium provided by the present application, the pinyin content corresponding to the voice control instruction is obtained as the first pinyin content and the pinyin content of the descriptive information to be selected is obtained as the first pinyin content. After a plurality of second pinyin content, if it is determined that there is no second pinyin content successfully matched with the first pinyin content, then obtain the pinyin content similar to the first pinyin content as the third pinyin content, and then use the third pinyin content Matching with the plurality of second pinyin contents, using the description information successfully matched between the corresponding second pinyin content and the third pinyin content as the target description information, and executing the control operation corresponding to the target description information.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not drive the essence of the corresponding technical solutions away from the spirit and scope of the technical solutions of the various embodiments of the present application.

Claims

A voice control method, characterized in that the method comprises:

Acquiring the first pinyin content and acquiring a plurality of second pinyin contents, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, and the plurality of second pinyin contents include the pinyin content of the descriptive information to be selected, The description information is information used to describe the corresponding operation;

Obtaining a third pinyin content when the second pinyin content fails to match the first pinyin content, the third pinyin content is pinyin content similar to the first pinyin content;

Matching the third pinyin content with the plurality of second pinyin content, and using the description information that the corresponding second pinyin content successfully matches with the third pinyin content as the target description information;

Execute the control operation corresponding to the target description information.
The method according to claim 1, wherein said obtaining the third pinyin content comprises:

Obtain the similar phoneme corresponding to the specified phoneme in the first pinyin content;

The specified phoneme in the first pinyin content is replaced with the similar phoneme to obtain the third pinyin content.
The method according to claim 2, wherein there are multiple similar phonemes, and replacing the specified phonemes in the first pinyin content with the similar phonemes to obtain the third pinyin content includes:

The specified phonemes in the first pinyin content are respectively replaced with a plurality of similar phonemes to obtain phoneme-substituted first pinyin content corresponding to each of the plurality of similar phonemes as the third pinyin content.
The method according to claim 2, wherein there are multiple designated phonemes, and said similar phonemes are used to replace the designated phonemes in the first pinyin content to obtain the third pinyin content, including:

combining similar phonemes corresponding to at least two designated phonemes with each other to obtain a plurality of phoneme pairs, wherein each phoneme pair includes a similar phoneme corresponding to each of the at least two designated phonemes;

Respectively based on the plurality of phonemes, the corresponding designated phonemes in the first pinyin content are replaced to obtain the first replacement pinyin content corresponding to each phoneme pair;

Replacing the corresponding designated phonemes in the first pinyin content with similar phonemes corresponding to a plurality of designated phonemes to obtain a second replacement pinyin content corresponding to each designated phoneme;

The first pinyin replacement content and the second pinyin replacement content are used as the third pinyin content.
The method according to claim 2, wherein said obtaining similar phonemes corresponding to specified phonemes in said first pinyin content comprises:

Querying whether the phonemes included in the first pinyin content have corresponding phoneme correspondences in the phoneme extension table, and each of the phoneme correspondences represents a pair of similar phonemes;

The phoneme determined to have the corresponding phoneme relationship is used as the specified phoneme, and the similar phoneme corresponding to the specified phoneme is determined based on the phoneme corresponding relationship.
The method according to claim 1, wherein said obtaining the third pinyin content comprises:

Obtaining the characteristics of the first pinyin content;

Comparing the features of the first pinyin content with pre-acquired reference features respectively, wherein the reference features are features of pinyin content corresponding to a plurality of words acquired in advance;

Among the reference features, the pinyin content corresponding to the successfully compared reference features is used as the third pinyin content.
The method according to claim 6, characterized in that the reference feature of the successful comparison is the same as the feature of the first pinyin content.
The method according to claim 6, wherein said acquiring features of said first pinyin content includes:

Obtain the feature that obtains described first pinyin content through the mode of text vector;

Wherein, the manner of acquiring the reference features is the same as the manner of acquiring the features of the first pinyin content.
The method according to any one of claims 1-8, wherein the method further comprises:

When the second pinyin content is successfully matched with the first pinyin content, the description information of the corresponding second pinyin content successfully matched with the first pinyin content is used as the target description information;

Execute the control operation corresponding to the target description information.
The method according to any one of claims 1-9, wherein said matching said third pinyin content with said plurality of second pinyin content, and matching said second pinyin content with said first pinyin content The description information that successfully matches the three pinyin content is used as the target description information, including:

Matching the third pinyin content with the plurality of second pinyin content, when the second pinyin content is successfully matched with the third pinyin content, successfully matching the corresponding second pinyin content with the third pinyin content The description information of is used as the target description information;

When the second pinyin content does not successfully match the third pinyin content, obtain the similarities between a plurality of second pinyin content and the first pinyin content respectively, so as to obtain the similarity corresponding to each second pinyin content;

The descriptive information corresponding to the second pinyin content with the highest similarity is used as the target descriptive information.
The method according to claim 10, wherein said obtaining the similarities between a plurality of second pinyin content and said first pinyin content respectively, so as to obtain the corresponding similarity of each second pinyin content, comprises:

Obtaining the first reference similarities between the multiple second pinyin content and the first pinyin content respectively based on the longest common subsequence, so as to obtain the first reference similarity corresponding to each second pinyin content;

Obtaining second reference similarities between multiple second pinyin content and the first pinyin content based on edit distance, so as to obtain a second reference similarity corresponding to each second pinyin content;

The first reference similarity corresponding to each second pinyin content is added to the second reference similarity to obtain the similarity corresponding to each second pinyin content.
The method according to claim 10, wherein the descriptive information corresponding to the second pinyin content with the highest similarity as the target descriptive information includes:

If there is one corresponding second pinyin content with the highest similarity, then use the description information corresponding to the second pinyin content with the highest similarity as the target description information;

If there are multiple second pinyin contents with the highest similarity, obtain the text vector of the text content corresponding to the voice control instruction as the first text vector;

Obtain multiple text vectors corresponding to the description information corresponding to the second pinyin content with the highest similarity, so as to obtain multiple second text vectors;

respectively calculating the vector distances between multiple second text vectors and the first text vector;

The description information corresponding to a second text vector whose corresponding vector distance is the smallest is used as the target description information.
The method according to any one of claims 1-12, wherein said acquiring a plurality of second pinyin contents comprises:

Acquiring description information of multiple controls included in the target interface as description information to be selected;

The description information to be selected is converted into corresponding pinyin content to obtain a plurality of second pinyin content.
The method according to any one of claims 1-13, wherein the target description information is the description information corresponding to the controls in the target interface, and the execution of the control operation corresponding to the target description information includes:

Acquiring the triplet to which the control corresponding to the target description information belongs;

Obtaining user intent and object attachment information in the triplet;

Based on the user intention and object attachment information, the control operation corresponding to the target description information is executed in the manner of event injection.
The method according to any one of claims 1-13, wherein the target description information is the description information corresponding to the controls in the target interface, and the execution of the control operation corresponding to the target description information includes:

Acquiring the triplet to which the control corresponding to the target description information belongs;

Obtaining user intent and object attachment information in the triplet;

Based on the user intention and object attachment information, a control operation corresponding to the target description information is executed in a manner of simulating a click.
The method according to any one of claims 1-13, wherein the target description information is description information corresponding to an overall interface operation instruction.
A voice control device, characterized in that the device comprises:

The first pinyin content and the second pinyin content acquisition unit are used to acquire the first pinyin content and multiple second pinyin content, the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, and the multiple The second pinyin content includes the pinyin content of the descriptive information to be selected, and the descriptive information is information used to describe the corresponding operation;

The third pinyin content acquisition unit is configured to acquire a third pinyin content when the second pinyin content fails to match the first pinyin content, and the third pinyin content is a pinyin content similar to the first pinyin content;

A pinyin content matching unit, configured to match the third pinyin content with the plurality of second pinyin content, and use the description information that the corresponding second pinyin content successfully matches the third pinyin content as target description information ;

A control operation executing unit, configured to execute the control operation corresponding to the target description information.
An electronic device, characterized in that it includes one or more processors and memory;

One or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any one of claims 1-16.
A computer-readable storage medium, wherein a program code is stored in the computer-readable storage medium, wherein the method according to any one of claims 1-16 is executed when the program code is running.
A computer program product, comprising computer programs/instructions, characterized in that, when the computer program/instructions are executed by a processor, the steps of the method described in any one of claims 1-16 are implemented.