CN114049890A

CN114049890A - Voice control method and device and electronic equipment

Info

Publication number: CN114049890A
Application number: CN202111296079.3A
Authority: CN
Inventors: 曾理; 张晓帆
Original assignee: Hangzhou Douku Software Technology Co Ltd
Current assignee: Hangzhou Douku Software Technology Co Ltd
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-02-15
Also published as: WO2023077878A1

Abstract

The embodiment of the application discloses a voice control method, a voice control device and electronic equipment, wherein the method comprises the following steps: acquiring a first pinyin content and a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the plurality of second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information for describing corresponding operation; obtaining third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, wherein the third pinyin content is similar to the first pinyin content; matching the third pinyin content with a plurality of second pinyin contents, and taking the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information; and executing the control operation corresponding to the target description information. Therefore, the probability that the voice control instruction triggered by the user is successfully matched with the description information is improved through the mode, and the probability of accurately executing the voice control is further improved.

Description

Voice control method and device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voice control method and apparatus, and an electronic device.

Background

The electronic device can receive voice control instructions sent by a user through an auditory modality by combining artificial intelligence technology and a virtual personal assistant (voice assistant), so that the electronic device can realize voice control on the electronic device. However, in the related voice control process, there is also a problem that the probability of accurately performing voice control is yet to be improved.

Disclosure of Invention

In view of the foregoing, the present application provides a voice control method, an apparatus and an electronic device to achieve improvement of the foregoing problems.

In a first aspect, the present application provides a method for voice control, the method comprising: acquiring a first pinyin content and a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information for describing corresponding operation; obtaining third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, wherein the third pinyin content is similar to the first pinyin content; matching the third pinyin content with the plurality of second pinyin contents, and taking the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information; and executing the control operation corresponding to the target description information.

In a second aspect, the present application provides a voice-controlled apparatus, the apparatus comprising: a first pinyin content and second pinyin content obtaining unit, configured to obtain a first pinyin content and obtain a plurality of second pinyin contents, where the first pinyin content is a pinyin content corresponding to the obtained voice control instruction, the plurality of second pinyin contents include pinyin contents of description information to be selected, and the description information is information used for describing corresponding operations; a third pinyin content obtaining unit, configured to obtain a third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, where the third pinyin content is a pinyin content similar to the first pinyin content; the pinyin content matching unit is used for matching the third pinyin content with the plurality of second pinyin contents and taking the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information; and the control operation execution unit is used for executing the control operation corresponding to the target description information.

In a third aspect, the present application provides an electronic device comprising one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, the present application provides a computer-readable storage medium having a program code stored therein, wherein the program code performs the above method when running.

In a fifth aspect, the present application provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the above-described method.

According to the voice control method, the device, the electronic equipment, the computer program product and the storage medium, after the pinyin content corresponding to the voice control instruction is obtained as the first pinyin content and the pinyin content of the description information to be selected is obtained as the second pinyin contents, if it is determined that no second pinyin content is successfully matched with the first pinyin content, the pinyin content similar to the first pinyin content is obtained as the third pinyin content, then the third pinyin content is matched with the second pinyin contents, the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as the target description information, and the control operation corresponding to the target description information is executed.

Therefore, after the audio content directly converted by the voice control instruction is obtained, under the condition that the audio content directly converted cannot be successfully matched with the pinyin content of the description information to be selected, the corresponding similar pinyin content can be obtained and matched with the pinyin content of the description information to be selected based on the voice content directly converted, so that the probability that the voice control instruction triggered by a user is successfully matched with the description information is improved, and the probability of accurately executing voice control is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating an application scenario of a speech control method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an application scenario of another speech control method proposed in an embodiment of the present application;

fig. 3 is a flowchart illustrating a voice control method according to an embodiment of the present application;

FIG. 4 is a flow chart illustrating an embodiment of S120 of FIG. 3;

FIG. 5 is a flow chart illustrating a voice control method according to another embodiment of the present application;

FIG. 6 is a flowchart of an embodiment of S230 of FIG. 5;

FIG. 7 is a diagram illustrating a method for obtaining first alternative Pinyin content corresponding to each phoneme pair according to the present application;

FIG. 8 is a diagram illustrating a method for obtaining second alternative Pinyin content corresponding to each designated element according to the present application;

FIG. 9 is a flow chart illustrating a voice control method according to yet another embodiment of the present application;

FIG. 10 is a flowchart of an embodiment of S340 of FIG. 9;

FIG. 11 is a flowchart of an embodiment of S350 of FIG. 9;

FIG. 12 is a schematic diagram illustrating an implementation flow of a voice control method proposed in the present application;

fig. 13 is a block diagram illustrating a structure of a voice control apparatus according to an embodiment of the present application;

fig. 14 is a block diagram illustrating an electronic device according to the present application;

fig. 15 is a storage unit according to an embodiment of the present application, configured to store or carry program codes for implementing a voice control method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

By combining an artificial intelligence technology and a virtual personal assistant (voice assistant), the electronic device can receive a voice control instruction sent by a user through an auditory mode, then the voice control instruction of the user is converted into text information through an Automatic Speech Recognition (ASR) technology, and then subsequent understanding and mapping are carried out, so that the voice control of the electronic device is realized.

However, the inventor has found that, in the related voice control process, due to the complex influence of various factors such as the accent of the user in different regions, the habit of the language, and the noise interference in the command issuing process, there is a problem that the probability of accurately executing the voice control is to be improved. For example, the user voice control command is recognized as a similar tone character string, such as: the upper stroke is identified as ' Shaohua ', ' the upper stroke is identified as ' joke ', and the like.

Therefore, the inventor proposes a voice control method, a device, an electronic device and a computer program product in the application, in the method, after the pinyin content corresponding to the voice control instruction is obtained as first pinyin content and the pinyin content of description information to be selected is obtained as a plurality of second pinyin contents, if it is determined that no second pinyin content is successfully matched with the first pinyin content, the pinyin content similar to the first pinyin content is obtained as third pinyin content, then the third pinyin content is matched with the plurality of second pinyin contents, the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as target description information, and the control operation corresponding to the target description information is executed.

The following first introduces an application scenario related to the embodiment of the present application.

In the embodiment of the application, the provided voice control method can be executed by the electronic equipment. In this manner performed by the electronic device, all steps in the voice control method provided by the embodiment of the present application may be performed by the electronic device. For example, as shown in fig. 1, a voice acquisition device of the electronic device 100 may acquire a voice control instruction, and transmit the acquired voice acquisition instruction and description information to be selected to a processor, so that the processor may acquire a first pinyin content and a plurality of second pinyin contents, and then the processor may execute steps related to the voice control method provided by the present application by using the acquired first pinyin content, the acquired plurality of second pinyin contents, and a pinyin content (third pinyin content) similar to the first pinyin content.

Moreover, the voice control method provided by the embodiment of the application can also be executed by a server. Correspondingly, in this manner executed by the server, the electronic device may collect the voice control instruction, send the collected voice control instruction to the server synchronously, then the server executes the voice control method provided in the embodiment of the present application to determine the target description information, and then the server generates the operation instruction according to the target description information. In addition, the method can be executed by cooperation of the electronic device and the server. In this manner, the electronic device and the server cooperatively perform some steps in the voice control method provided by the embodiment of the present application, and some other steps are performed by the electronic device and the server.

For example, as shown in fig. 2, the electronic device 100 may perform a voice control method including: the first pinyin content is obtained and a plurality of second pinyin contents are obtained, and then subsequent steps are performed by the server 200. It should be noted that, in this manner executed by the electronic device and the server cooperatively, the steps executed by the electronic device and the server respectively are not limited to the manner described in the above example, and in practical applications, the steps executed by the electronic device and the server respectively may be dynamically adjusted according to actual situations.

Embodiments to which the present application relates will be described below with reference to the accompanying drawings.

Referring to fig. 3, a voice control method provided in the present application includes:

s110: the method comprises the steps of obtaining a first pinyin content and obtaining a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the obtained voice control instruction, the second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information used for describing corresponding operation.

In the embodiment of the application, the user can express own control intention through voice. Correspondingly, the electronic equipment can take the voice sent by the user as the voice control instruction. Optionally, the instruction here refers to an instruction for a user to manipulate a visible interactive interface or an element on the interactive interface, and the voice control instruction may include: left-stroke, open today's head bar, beep li, play XXX, second row third, go back, up-stroke, install tremble, next, table, etc.

As one way, the first pinyin content may be obtained through an Automatic Speech Recognition (ASR) technique and a Natural Language understanding (NLP) technique.

Optionally, after the electronic device obtains the voice control instruction, the electronic device may transmit the voice control instruction of the user to the ASR module to obtain an instruction text corresponding to the voice control instruction, and then use the pinyin content corresponding to the instruction text as the first pinyin content. Optionally, after the instruction text is obtained, the user intention, the control object, and the object auxiliary information in the instruction text may be extracted through the NLP module and integrated into a triple in the form of { action, object, information }, where the action represents the user intention, the object represents the control object, and the information represents the object auxiliary information.

In a triplet, the user intent refers to an operation that the user wishes to perform, such as: click, slide, long press, etc. The auxiliary information refers to information that may accompany the control object, such as: when inputting, the text box is the control object, and the characters to be filled are the auxiliary information. It should be noted that the control object and the attached information are not necessarily required. In the manner of converting the instruction text into the triple, after the triple corresponding to the voice control instruction is obtained, the pinyin corresponding to the control object in the triple can be used as the first pinyin content, and if the control object of the triple is empty, the content corresponding to the user intention can be used as the first pinyin content. For example, the speech control command of the user may be "turn on today's headline", and the triples that can be obtained by the ASR module and the NLP module are: { click, today's headline, Φ }, where the user intent is: and clicking, wherein the control object is the top item of the current day, the auxiliary information of the object is null, and the first Pinyin content is 'jin ri tou tiao'. For another example: the user instruction may be "stroke up", and the triples obtained by the ASR module and the NLP module are: { dash, Φ, Φ }, where the user's intent is "dash up", the control object is null, and the object associated information is null, the first pinyin content is "shang hua".

Furthermore, in this embodiment of the application, the to-be-selected description information may be a set of description information of operations that the electronic device may perform when the voice control instruction is acquired. The operation that can be performed by the electronic device may be an operation performed on the whole electronic device, for example, turning off, switching an operation mode, or taking a picture. Still further, the operations that may be performed by the electronic device may include operations performed with respect to a target interface. The target interface may be an interface currently displayed by the electronic device. Furthermore, in this manner of the operation performed on the target interface, the description information to be selected may include description information of each of a plurality of controls in the target interface, for example: the bee fire is resistant to big, the Olympic brocade, the eighth season of the solitary gourmet family and the like. The description information to be selected may also include description information corresponding to all interface overall operation instructions, for example: left swipe, right swipe, up swipe, down swipe, back, desktop, double click, long press, etc.

As a mode, the pinyin content corresponding to all the description information to be selected can be obtained as the second pinyin content. Optionally, the respective description information of the multiple controls included in the target interface may be acquired as the description information to be selected, and then the description information to be selected is converted into the corresponding pinyin content, so as to obtain multiple second pinyin contents. Optionally, the description information corresponding to all the interface overall operation instructions may be acquired as the description information to be selected, and then the description information to be selected is converted into the corresponding pinyin content, so as to obtain a plurality of second pinyin contents. Moreover, the plurality of second pinyin contents may also include pinyin contents corresponding to the description information corresponding to the interface overall operation instruction and pinyin contents corresponding to the description information of each of the plurality of controls included in the target interface.

In this way, the electronic device may analyze the code corresponding to the target interface by using the system program, and may obtain information such as the type, the position, the size, and the like of each control as the description information of the control.

It should be noted that there are various ways to obtain the pinyin corresponding to the text, for example: pypinyin and xpinyin in a Python library, pinyin4J in a Java library and the like can select which mode is used to realize the operation of converting text into pinyin according to the actual development environment.

S120: and acquiring third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, wherein the third pinyin content is similar to the first pinyin content.

After the first pinyin content and the second pinyin content are obtained, whether a second pinyin content successfully matched with the first pinyin content exists in the plurality of second pinyin contents can be detected. Optionally, if the second pinyin content is identical to the first pinyin content, it is determined that the second pinyin content is successfully matched with the first pinyin content. Illustratively, if the first pinyin content is "shao hua" and the second pinyin content currently matching the "shao hua" is "shang hua", it is determined that the first pinyin content "shao hua" and the second pinyin content "shang hua" do not match because "ao" in the first pinyin content and "ang" in the second pinyin content are different.

In the embodiment of the present application, as shown in fig. 4, as a manner, acquiring the third pinyin content includes:

s121: and acquiring similar phonemes corresponding to the designated phonemes in the first Pinyin content.

As a mode, whether the phoneme included in the first pinyin content has a corresponding phoneme corresponding relationship in the phoneme extension table may be queried, where each phoneme corresponding relationship represents a pair of similar phonemes; and taking the phoneme determined to have the phoneme corresponding relation as a designated phoneme, and determining a similar phoneme corresponding to the designated phoneme based on the phoneme corresponding relation.

Wherein, the phoneme (phone) is the minimum voice unit divided according to the natural attribute of the voice, and one pronunciation action forms one phoneme. In chinese, phonemes can be divided into initials and finals. In the Chinese phonetic notation rule, when y is added in front of a compound vowel i and a compound vowel beginning with i (such as i, ia, ie, iao, iou, ian, in, iang, iong, and the like), the compound vowel can be marked as yi, ya, ye, yao, you, yan, yin, yang, yong, and the like; when the initial consonant corresponding to the compound vowel u is j, q, x or there is no initial consonant, u can be omitted from the above two points and is denoted as u, for example: yu, yue, yuan, yun, ju, qu, xu; when the initial consonant corresponding to Yu is n, l, it can be written as Yu, Lu, so that u can be substituted by u in some cases.

Moreover, due to the influence of accents, language habits and the like of users in different regions, the users may confuse some similar phonemes, thereby causing the situation that the recognition of the user voice control command is inaccurate. Therefore, a phoneme extension table as shown in table 1 can be formed by combining the rules of the chinese pinyin notation and common errors of chinese pronunciation.

TABLE 1

Illustratively, when the user voice control command "up stroke" is recognized as the harmonic word "shaohua" and "shao hua" is used as the first pinyin content, because the first pinyin content includes the following phonemes: sh, ao, h, ua, so the following phoneme correspondence can be obtained according to the phoneme expansion table: [ sh, s ], [ sh, c ], [ sh, xi ], [ sh, zh ], [ ao, ou ], [ ao, iao ], [ ao, ang ], [ h, f ], then sh, ao, h can be used as the designated phoneme, and the similar phoneme corresponding to the designated phoneme is determined to be: s, c, xi, zh, ou, iao, ang, f.

S122: and replacing the appointed phoneme in the first pinyin content by the similar phoneme to obtain a third pinyin content.

As another way, the pinyin content similar to the first pinyin content may be obtained as the third pinyin content as a whole. In this way, the characteristics of the pinyin contents corresponding to the plurality of words can be directly obtained in advance as reference characteristics, after the first pinyin content is obtained, the characteristics of the first pinyin content can be obtained in the same way, then the characteristics of the first pinyin content are respectively compared with the reference characteristics obtained in advance, and the pinyin content corresponding to the reference characteristics which are successfully compared is used as the third pinyin content. Wherein, the reference characteristics successfully compared are the same as the characteristics of the first pinyin content. In this way, the relevant ways of obtaining the data characteristics may be all suitable for obtaining the characteristics of the pinyin content, and the way of specifically obtaining the characteristics of the pinyin content is not specifically limited in the embodiments of the present application. For example, the characteristics of the pinyin content may be obtained by means of a text vector.

S130: and matching the third pinyin content with the plurality of second pinyin contents, and using the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information.

Illustratively, the third pinyin content may be: { "sao hua", "cao hua", "xiao hua", "zhao hua", "shou hua", "shiao hua", "shang hua", "shao fu" }, the second pinyin content may be: { "feng hu kang da (beacon fire resistance)", "ao yun ji jin (olympic collection)", "gu du de mei shi jia di ba ji (the eighth season of the solitary gourmet)", "zuo hua", "you hua", "shang hua", "xia hua", "fan hua", "zhu mian", "shu and", "shuang ji", "chang an" }, the third pinyin content is matched with the second pinyin content, and the target description information "shang hua" can be obtained.

S140: and executing the control operation corresponding to the target description information.

As a mode, the target description information may be description information corresponding to a control in the target interface, and control operation corresponding to the target description information may be executed on the electronic device in a mode of event injection or click simulation by combining user intention and object attached information in a triple to which the control belongs and the target description information corresponds. For example: if the target description information is "sou suo kuang", the user intention and the object auxiliary information in the triple { search, search box, happy big foot } can be combined, and events are injected: a happy big foot is input in the search box, and a control operation corresponding to the target description information "sou suo kuang" may be performed at the electronic device. For another example: if the target description information is 'ao yun ji jin', the control operation corresponding to the target description information 'ao yun ji jin' can be executed on the electronic equipment by clicking the Olympic collection control in combination with the user intention in the triple { clicking, Olympic collection, phi }.

As another mode, the target description information may be description information corresponding to the interface overall operation instruction. For example: if the target description information is "shang hua", the drawing operation can be directly executed on the electronic device.

In the method, after obtaining the pinyin content corresponding to the voice control instruction as the first pinyin content and obtaining the pinyin content of the description information to be selected as the second pinyin contents, if it is determined that no second pinyin content is successfully matched with the first pinyin content, the pinyin content similar to the first pinyin content is obtained as the third pinyin content, the third pinyin content is matched with the second pinyin contents, the description information corresponding to the second pinyin content and the third pinyin content is successfully matched is used as the target description information, and the control operation corresponding to the target description information is executed.

Therefore, after the audio content directly converted by the voice control instruction is obtained, under the condition that the audio content directly converted cannot be successfully matched with the pinyin content of the description information to be selected, the corresponding similar pinyin content can be obtained and matched with the pinyin content of the description information to be selected based on the voice content directly converted, so that the probability that the voice control instruction triggered by a user is successfully matched with the description information is improved, and the probability of accurately executing voice control is further improved. In addition, in the embodiment, the concepts of phonemes in linguistics and acoustics are combined, an initial consonant and final sound confusion expansion table is established according to common errors in the mandarin chinese language, and pinyin which cannot be accurately matched is subjected to fuzzy expansion and then matched, so that the problem of harmonic character errors in the voice recognition process is solved, and the problem of voice recognition errors caused by the fact that a user pronounces nonstandard is also effectively solved.

Referring to fig. 5, a voice control method provided in the present application includes:

s210: the method comprises the steps of obtaining a first pinyin content and obtaining a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the obtained voice control instruction, the second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information used for describing corresponding operation.

S220: and acquiring similar phonemes corresponding to the designated phonemes in the first Pinyin content.

S230: and when the second pinyin content is not successfully matched with the first pinyin content, replacing the designated phoneme in the first pinyin content with the similar phoneme to obtain a third pinyin content.

As a mode, the designated phoneme in the first pinyin content may be replaced by a plurality of similar phonemes, respectively, to obtain a first pinyin content after phoneme replacement corresponding to each of the plurality of similar phonemes, so as to serve as a third pinyin content.

For example, the first pinyin content may be "shao hua", and as can be seen from table 1, the designated phonemes of the first pinyin content "shao hua" may be sh, ao, and h, where the similar phonemes corresponding to sh are s, c, xi, and zh, the similar phonemes corresponding to ao are ou, iao, and ang, and the similar phonemes corresponding to h are f. Replacing the designated phoneme in the first pinyin content with a plurality of similar phonemes respectively to obtain a third pinyin content which is { "sao hua", "cao hua", "xiao hua", "zhao", "shou hua", "shiao hua", "shang hua", "shao hua" }.

Alternatively, as shown in fig. 6, replacing the designated element in the first pinyin content with the similar element to obtain a third pinyin content, including:

s231: combining the similar phonemes corresponding to the at least two appointed phonemes with each other to obtain a plurality of phoneme pairs, wherein each phoneme pair comprises one similar phoneme corresponding to each of the at least two appointed phonemes.

In the embodiment of the present application, similar phonemes corresponding to at least two designated phonemes may be combined with each other according to a combination manner shown in fig. 7. Referring to fig. 5, if the designated element a corresponds to a similar element O, P, Q, the designated element B corresponds to a similar element R, S, T, and the first pinyin content is ABC, each similar element of the designated element a and all similar elements of the designated element B may be combined one by one to obtain the following element pairs: OR, OS, OT, PR, PS, PT, QR, QS, QT. For example, the first pinyin content may be "shao hua", and similar phonemes corresponding to sh and ao in the designated phoneme corresponding to the first pinyin content "shao hua" may be selected to be combined with each other in the combination manner shown in fig. 5, so as to obtain the following phoneme pairs: sou, siao, sang, cou, ciao, cang, xiao.

S232: and replacing the designated phoneme corresponding to the first pinyin content based on the plurality of phonemes to obtain first replaced pinyin content corresponding to each phoneme.

In the embodiment of the present application, as shown in fig. 7, after obtaining a plurality of phoneme pairs (OR, OS, OT, PR, PS, PT, QR, QS, QT), by replacing the corresponding designated phoneme in the first pinyin content ABC based on the plurality of phonemes, the first alternative pinyin content may be obtained by: ORC, OSC, OTC, PRC, PSC, PTC, QRC, QSC, QTC. For example, if the phoneme pair is sou, the corresponding first alternative pinyin content is "sou hua", and if the phoneme pair is cang, the corresponding first alternative pinyin content is "cang hua".

S233: and replacing the corresponding designated phoneme in the first pinyin content by using the similar phonemes corresponding to the designated phonemes to obtain second replaced pinyin content corresponding to each designated phoneme.

In the embodiment of the present application, the designated phonemes in the first pinyin content may be replaced according to the manner shown in fig. 8, so as to obtain a second replacement pinyin content corresponding to each designated phoneme. Referring to fig. 8, if the designated element a corresponds to the similar element O, P, Q, the designated element B corresponds to the similar element R, S, T, and the first pinyin content is ABC, the designated element a may be replaced with the similar element O, P, Q one by one to obtain second alternative pinyin contents OBC, PBC, and QBC corresponding to the designated element a, and then the designated element B may be replaced with the similar element R, S, T to obtain second alternative pinyin contents ARC, ASC, and ATC corresponding to the designated element B. For example, as shown in table 1, if the designated element of the first pinyin content may be "shao hua" is sh, ao, and h, the second alternative pinyin content corresponding to sh is { "sao hua", "cao hua", "xiao hua", "zhao" }, the second alternative pinyin content corresponding to ao is { "shou hua", "shiao hua", "shang hua", and the second alternative pinyin content corresponding to h is { "shao hua" }.

S234: and taking the first alternative pinyin content and the second alternative pinyin content as third pinyin content.

Compared with the first mode of obtaining the third pinyin content, the similarity expansion of the first pinyin content can be further carried out by taking the first alternative pinyin content and the second alternative pinyin content as the third pinyin content, so that the matching range with the second pinyin content is further expanded, and the matching success probability is improved.

S240: and matching the third pinyin content with the plurality of second pinyin contents, and using the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information.

S250: and executing the control operation corresponding to the target description information.

According to the voice control method provided by the embodiment, after the audio content directly converted by the voice control instruction is obtained, under the condition that the audio content directly converted cannot be successfully matched with the pinyin content of the to-be-selected description information, the corresponding similar pinyin content can be obtained and matched with the pinyin content of the to-be-selected description information based on the directly converted voice content, so that the probability that the voice control instruction triggered by a user is successfully matched with the description information is improved, and the probability of accurately executing voice control is further improved. In addition, in this embodiment, similar phonemes of the designated phoneme may be obtained by querying the phoneme expansion table, and the plurality of designated phonemes are replaced with the similar phonemes in a plurality of ways to obtain a third pinyin content, where the third pinyin content is similar expansion performed on the basis of the first pinyin content, so that the matching range is increased, the probability of successful matching is improved, and the probability of accurately performing the voice control is improved.

Referring to fig. 9, a voice control method provided by the present application is applied to an electronic device, and the method includes:

s310: the method comprises the steps of obtaining a first pinyin content and obtaining a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the obtained voice control instruction, the second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information used for describing corresponding operation.

S320: and acquiring third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, wherein the third pinyin content is similar to the first pinyin content.

S330: and matching the third pinyin content with the plurality of second pinyin contents, wherein when the second pinyin content is successfully matched with the third pinyin content, the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as target description information.

S340: and when the second pinyin content is not successfully matched with the third pinyin content, acquiring the similarity between the plurality of second pinyin contents and the first pinyin content respectively so as to obtain the similarity corresponding to each second pinyin content.

As shown in fig. 10, obtaining the similarity between each of the plurality of second pinyin contents and the first pinyin content to obtain the similarity corresponding to each of the second pinyin contents may include:

s341: and acquiring the first reference similarity between the plurality of second pinyin contents and the first pinyin content respectively based on the longest common subsequence to obtain the first reference similarity corresponding to each second pinyin content.

In the embodiment of the present application, the first reference similarities between the multiple second pinyin contents and the first pinyin contents may be measured by a Longest Common Subsequence (LCS), and a calculation formula of the LCS may be:

wherein A is_iCan represent a character string consisting of the first i characters of the character string A, the value range of i is 0-the maximum length of the character string A, and similarly, B_jCan represent a character string consisting of the first j characters of the character string B, wherein the value range of j is between 0 and the maximum length of the character string B, a_i、b_jThe i and j characters in A, B can be represented respectively. Illustratively, a character string a may represent a first pinyin content, a character string B may represent a second pinyin content, a length of the first pinyin content is 10, a length of the second pinyin content is 9, a value of i ranges from 0 to 10, a value of j ranges from 0 to 9, and if a is₁₀＝b₉LCS (A)₁₀,B₉)＝LCS(A₉,B₈)+a₁₀Otherwise LCS (A)₁₀,B₉)＝max{LCS(A₁₀,B₈)，LCS(A₉,B₉)}。

LCS similarity can be defined as:

wherein | a |, | B | may respectively represent the length of the character string A, B, i.e., the number of all characters in A, B. For example, the character string a may be "APPLE 13", and | a | ═ 7.

S342: and acquiring second reference similarity of the plurality of second pinyin contents and the first pinyin content respectively based on the editing distance to obtain second reference similarity corresponding to each second pinyin content.

In the embodiment of the present application, the difference degree between each of the plurality of second pinyin contents and the first pinyin content may be measured by an edit Distance (LEV), and the similarity and the difference degree are inversely related, so that the second reference similarity between each of the plurality of second pinyin contents and the first pinyin content may be measured by the following formula.

The calculation formula of the LEV may be:

wherein A is_iCan represent a character string consisting of the first i characters of the character string A, the value range of i is 0-the maximum length of the character string A, and similarly, B_jCan represent a character string consisting of the first j characters of the character string B, and the value range of j is 0-the maximum length of the character string B. Illustratively, a character string a may represent a first pinyin content, a character string B may represent a second pinyin content, a length of the first pinyin content is 10, a length of the second pinyin content is 9, a value of i ranges from 0 to 10, a value of j ranges from 0 to 9, and if a is₁₀＝b₉Then LEV (A)₁₀,B₉)＝min{LEV(A₉,B₁₀)+1，LEV(A₁₀,B₉)+1，LEV(A₉,B₈) Else, LEV (A)₁₀,B₉)＝min{LEV(A₉,B₁₀)+1，LEV(A₁₀,B₉)+1，LEV(A₉,B₈)+1}。

S343: and adding the first reference similarity and the second reference similarity corresponding to each second pinyin content to obtain the similarity corresponding to each second pinyin content.

As a mode, the first reference similarity and the second reference similarity may be directly added to obtain the similarity corresponding to each second pinyin content, and the calculation formula is as follows:

S(A,B)＝S_LCS(A,B)+S_LEV(A,B)

as another way, the first reference similarity and the second reference similarity may be given with respective weights, and the first reference similarity and the second reference similarity are weighted and then added to obtain the similarity corresponding to each second pinyin content, and the calculation formula is as follows:

S(A,B)＝X×S_LCS(A,B)+Y×S_LEV(A,B)

wherein X + Y is 1.

S350: and taking the description information corresponding to the second pinyin content with the maximum corresponding similarity as target description information.

As shown in fig. 11, taking the description information corresponding to the second pinyin content with the largest corresponding similarity as the target description information includes:

s351: if one second pinyin content with the maximum corresponding similarity exists, the description information corresponding to the second pinyin content with the maximum corresponding similarity is used as the target description information.

S352: and if a plurality of second pinyin contents with the maximum corresponding similarity exist, acquiring a text vector of the text content corresponding to the voice control instruction as a first text vector.

In the embodiment of the present application, some abbreviations or acronyms may appear in the voice control command of the user, which may result in obtaining a plurality of most similar results by means of the longest common subsequence and the editing distance. For example: the voice control instruction of the user is 'multi-up', the second pinyin content set comprises { 'vengeant alliance 4', 'several copy couplets' }, 'multi-up' and the longest common subsequence of two objects to be matched are 'multi-up', the editing distance is 4, and therefore the calculated similarity is the same, and the unique result cannot be determined. For another example: the user voice control command is 'B station', and the second phonetic content set comprises [ beep Li, Q music, A cloud music and B music ], so that a matching result cannot be obtained. In this case, the similarity between each of the plurality of second pinyin contents and the first pinyin content may be measured based on the semantic similarity, so as to obtain a most similar second pinyin content.

As one approach, the text vectors may be obtained by a pre-trained model BERT. The BERT is a deep neural network, and can input texts to be processed into an encoder part of the BERT to obtain corresponding text vectors.

In this embodiment of the application, the text input corresponding to the first text vector may be text content corresponding to the speech control instruction obtained by the ASR module, may also be text content of a triple corresponding to the speech control instruction obtained by the ASR module and the NLP module, and may also be text content corresponding to the third pinyin.

S353: and acquiring text vectors corresponding to the description information corresponding to the second pinyin contents with the maximum similarity to obtain a plurality of second text vectors.

In this embodiment of the present application, the text input corresponding to the second text vector may be text description information of each of the multiple controls in the target interface acquired through the system program, or may also be text description information of the interface overall operation instruction, for example: left swipe, right swipe, up swipe, down swipe, back, desktop, double click, long press, etc.

It should be noted that the text input corresponding to the text vector may be a chinese character string or a pinyin character string.

Furthermore, it should be noted that in the embodiment of the present application, the text vector may also be obtained by a tool such as Doc2Vec (document steering amount), or an open-source pre-training model such as RoBERTA, UniLM, ELECTRA, XLNet.

S354: and respectively calculating the vector distance between the plurality of second text vectors and the first text vector.

As one mode, the vector distance between each second text vector and the first text vector is calculated through cosine similarity, and the calculation formula is as follows:

s355: and taking the description information corresponding to the second text vector with the minimum corresponding vector distance as the target description information.

As one mode, after the vector distances between the plurality of second text vectors and the first text vector are obtained, the vector distances may be sorted, and the description information corresponding to the second text vector with the smallest vector distance is used as the target description information.

It should be noted that, because the text vectors are continuously distributed in the high-dimensional space, the probability of two text vectors having the same similarity numerically is negligible, and therefore, the description information corresponding to the unique second text vector can be determined as the target description information.

By the method, when the unique matching result cannot be obtained due to the existence of the abbreviation or abbreviation in the voice control instruction of the user, the vector distance between the plurality of second text vectors and the first text vector can be calculated to obtain the target description information corresponding to the unique matching result, so that the control operation corresponding to the target description information is executed, and the success rate of semantic recognition is further improved.

S360: and executing the control operation corresponding to the target description information.

It should be noted that, in the embodiment of the present application, in the process of executing S450, if it is determined that there are a plurality of second pinyin contents with the largest corresponding similarity, a text vector corresponding to the first pinyin content may also be obtained as the first text vector. Moreover, a text vector corresponding to the third pinyin content may also be obtained as the first text vector. In this way, if there are a plurality of first text vectors that may be obtained in the manner of obtaining the text vector corresponding to the third pinyin content as the first text vector, the vector distance between each of the plurality of first text vectors and each of the plurality of second text vectors is calculated, and then the description information corresponding to the second text vector having the shortest vector distance is used as the target description information. For example, if the first text vectors obtained based on the third pinyin content include the first text vector L1, the first text vector L2, and the first text vector L3, and the second text vectors include the second text vector L4 and the second text vector L5. In calculating the vector distances, the distances between the first text vector L1 and the second text vectors L4 and L5, respectively, and the distances between the first text vector L2 and the second text vectors L4 and L5, respectively, and the distances between the first text vector L3 and the second text vectors L4 and L5, respectively, are calculated.

According to the voice control method provided by the embodiment, after the audio content directly converted by the voice control instruction is obtained, under the condition that the audio content directly converted cannot be successfully matched with the pinyin content of the to-be-selected description information, the corresponding similar pinyin content can be obtained and matched with the pinyin content of the to-be-selected description information based on the directly converted voice content, so that the probability that the voice control instruction triggered by a user is successfully matched with the description information is improved, and the probability of accurately executing voice control is further improved. In addition, in this embodiment, under the condition that there is no successful matching between the second pinyin content and the third pinyin content, the similarity corresponding to each second pinyin content can be obtained by obtaining the similarities between the plurality of second pinyin contents and the first pinyin content, and the description information corresponding to the second pinyin content with the largest similarity is taken as the target description information, so that the problem that the description of the interface control by the user is difficult to match due to deletion and modification is solved, the problem that the user refers to the control in abbreviation and alternative ways, and the problem that the control is difficult to match due to the fact that the user refers to the control is solved, so that the control operation corresponding to the target description information can be executed, and the probability of accurately executing the voice control is improved.

Moreover, the scheme of the patent adopts a semantic similarity mode to match the voice control command with the description information, carries out vectorization on the command text (text converted from the voice control command) to be matched through a large-scale pre-training model, and completes matching by using the similarity of vectors, so that the problem that the voice control command has larger difference with the description information but has the same meaning can be solved.

In order to better understand the solutions of all embodiments of the present application, an implementation flow of the speech control method of the present application is described below.

Referring to fig. 12, after the step S4010 is executed to obtain the first pinyin content and the plurality of second pinyin contents, the first pinyin content may be matched with the plurality of second pinyin contents, and when the second pinyin content is successfully matched with the first pinyin content, the description information of the successful matching between the corresponding second pinyin content and the first pinyin content may be used as the target description information, and the control operation corresponding to the target description information is executed; and when the second pinyin content is not successfully matched with the first pinyin content, the operation of acquiring the third pinyin content can be executed. The third pinyin content may be obtained by querying whether the phonemes included in the first pinyin content have corresponding phoneme correspondence in the phoneme extension table according to table 1, using the phonemes determined to have the phoneme correspondence as the designated phonemes, determining similar phonemes corresponding to the designated phonemes based on the phoneme correspondence, and replacing the designated phonemes in the first pinyin content with the similar phonemes.

After the step S4050 is executed to obtain the third pinyin content, the third pinyin content may be matched with a plurality of second pinyin contents, and if the second pinyin content is successfully matched with the third pinyin content, the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as the target description information; when the second pinyin content is not successfully matched with the third pinyin content, step S4090 may be executed to obtain the similarity between each of the plurality of second pinyin contents and the first pinyin content, so as to obtain the similarity corresponding to each of the second pinyin contents, and then the description information corresponding to the second pinyin content with the largest corresponding similarity is used as the target description information, and the control operation corresponding to the target description information is executed.

The method comprises the steps that reference similarity between a plurality of second pinyin contents and a first pinyin content can be obtained based on the longest public subsequence and the editing distance, so that the reference similarity corresponding to each second pinyin content is obtained, if one second pinyin content with the largest corresponding similarity exists, description information corresponding to the second pinyin content with the largest corresponding similarity serves as target description information, and control operation corresponding to the target description information is executed; if a plurality of second pinyin contents with the maximum similarity exist, the text vector of the text content corresponding to the voice control instruction can be obtained and used as the text vector corresponding to the description information corresponding to the first text vector and the second pinyin contents with the maximum similarity respectively to obtain a plurality of second text vectors, then the vector distance between the second text vectors and the first text vector is obtained through calculation respectively, the description information corresponding to the second text vector with the minimum corresponding vector distance is used as the target description information, and the control operation corresponding to the target description information is executed.

Referring to fig. 13, the present application provides a voice control apparatus 600, where the apparatus 600 includes:

the first pinyin content and second pinyin content obtaining unit 610 is configured to obtain a first pinyin content and obtain a plurality of second pinyin contents, where the first pinyin content is a pinyin content corresponding to the obtained voice control instruction, the plurality of second pinyin contents include pinyin contents of description information to be selected, and the description information is information used for describing corresponding operations.

A third pinyin content obtaining unit 620, configured to obtain a third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, where the third pinyin content is a pinyin content similar to the first pinyin content.

A pinyin content matching unit 630, configured to match the third pinyin content with the plurality of second pinyin contents, and use description information of successful matching between the corresponding second pinyin content and the third pinyin content as target description information.

A control operation executing unit 640, configured to execute a control operation corresponding to the target description information.

As a mode, the first pinyin content and second pinyin content obtaining unit 610 is specifically configured to obtain respective description information of a plurality of controls included in the target interface as description information to be selected; and converting the description information to be selected into corresponding pinyin contents so as to obtain a plurality of second pinyin contents.

As a manner, the third pinyin content obtaining unit 620 is specifically configured to obtain similar phonemes corresponding to a specified phoneme in the first pinyin content; and replacing the appointed phoneme in the first pinyin content by the similar phoneme to obtain a third pinyin content. The third pinyin content obtaining unit 620 is specifically configured to replace a designated element in the first pinyin content with a plurality of similar elements, respectively, to obtain a first pinyin content with the replaced element corresponding to each of the plurality of similar elements, and use the first pinyin content as the third pinyin content. Optionally, the third phonetic content obtaining unit 620 is specifically configured to combine similar phonemes corresponding to at least two designated phonemes with each other to obtain a plurality of phoneme pairs, where each phoneme pair includes a similar phoneme corresponding to each of the at least two designated phonemes; respectively replacing the designated phoneme corresponding to the first pinyin content based on the plurality of phonemes to obtain first replaced pinyin content corresponding to each phoneme; replacing the corresponding designated phoneme in the first pinyin content by using the similar phonemes corresponding to the designated phonemes to obtain second replaced pinyin content corresponding to each designated phoneme; and taking the first alternative pinyin content and the second alternative pinyin content as third pinyin content.

As another mode, the third pinyin content obtaining unit 620 is specifically configured to query whether a phoneme included in the first pinyin content has a corresponding phoneme corresponding relationship in the phoneme extension table, where each phoneme corresponding relationship represents a pair of similar phonemes; and taking the phoneme determined to have the phoneme corresponding relation as a designated phoneme, and determining a similar phoneme corresponding to the designated phoneme based on the phoneme corresponding relation.

As one way, the pinyin content matching unit 630 is specifically configured to match the first pinyin content with a plurality of second pinyin contents; and executing the third pinyin content when the second pinyin content is not successfully matched with the first pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to, when the second pinyin content is successfully matched with the first pinyin content, use description information of the corresponding second pinyin content successfully matched with the first pinyin content as target description information; and executing the control operation corresponding to the target description information.

As another mode, the pinyin content matching unit 630 is specifically configured to match the third pinyin content with the plurality of second pinyin contents, and when the second pinyin content is successfully matched with the third pinyin content, the description information that the corresponding second pinyin content is successfully matched with the third pinyin content is used as the target description information; when the second pinyin content is not successfully matched with the third pinyin content, acquiring the similarity between a plurality of second pinyin contents and the first pinyin content respectively to obtain the similarity corresponding to each second pinyin content; and taking the description information corresponding to the second pinyin content with the maximum corresponding similarity as target description information. Optionally, the pinyin content matching unit 630 is specifically configured to obtain first reference similarities between the plurality of second pinyin contents and the first pinyin content, respectively, based on the longest common subsequence, so as to obtain a first reference similarity corresponding to each second pinyin content; acquiring second reference similarity of a plurality of second pinyin contents and the first pinyin contents respectively based on the editing distance to obtain second reference similarity corresponding to each second pinyin content; and adding the first reference similarity and the second reference similarity corresponding to each second pinyin content to obtain the similarity corresponding to each second pinyin content. Optionally, the pinyin content matching unit 630 is specifically configured to, if there is one corresponding second pinyin content with the greatest similarity, use the description information corresponding to the corresponding second pinyin content with the greatest similarity as the target description information; if a plurality of second pinyin contents with the maximum corresponding similarity exist, acquiring a text vector of the text content corresponding to the voice control instruction as a first text vector; acquiring text vectors corresponding to description information corresponding to the second pinyin contents with the maximum similarity to obtain a plurality of second text vectors; respectively calculating vector distances between a plurality of second text vectors and the first text vector; and taking the description information corresponding to the second text vector with the minimum corresponding vector distance as the target description information.

An electronic device provided by the present application will be described below with reference to fig. 14.

Referring to fig. 14, based on the voice control method and apparatus, an electronic device 1000 capable of executing the voice control method is further provided in the embodiment of the present application. The electronic device 1000 includes one or more processors 102 (only one shown), a memory 104, a camera 106, and an audio capture device 108 coupled to each other. The memory 104 stores programs that can execute the content of the foregoing embodiments, and the processor 102 can execute the programs stored in the memory 104.

Processor 102 may include one or more processing cores, among other things. The processor 102 interfaces with various components throughout the electronic device 1000 using various interfaces and circuitry to perform various functions of the electronic device 1000 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104 and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 102 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 102, but may be implemented by a communication chip. By one approach, the processor 102 may be a neural network chip. For example, it may be an embedded neural network chip (NPU).

The Memory 104 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like.

Furthermore, the electronic device 1000 may further include a network module 110 and a sensor module 112 in addition to the aforementioned components.

The network module 110 is used for implementing information interaction between the electronic device 1000 and other devices, for example, transmitting a device control instruction, a manipulation request instruction, a status information acquisition instruction, and the like. When the electronic device 200 is embodied as a different device, the corresponding network module 110 may be different.

The sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: levels, light sensors, motion sensors, pressure sensors, infrared heat sensors, distance sensors, acceleration sensors, and other sensors.

Among other things, the pressure sensor may detect the pressure generated by pressing on the electronic device 1000. That is, the pressure sensor detects pressure generated by contact or pressing between the user and the electronic device, for example, contact or pressing between the user's ear and the mobile terminal. Thus, the pressure sensor may be used to determine whether contact or pressure has occurred between the user and the electronic device 1000, as well as the magnitude of the pressure.

The acceleration sensor may detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and may be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the electronic device 1000, and related functions (such as pedometer and tapping) for vibration recognition. In addition, the electronic device 1000 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer and a thermometer, which are not described herein again.

And the audio acquisition device 110 is used for acquiring audio signals. Optionally, the audio capturing device 110 includes a plurality of audio capturing devices, and the audio capturing devices may be microphones.

As one mode, the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is configured to receive and transmit electromagnetic waves, and implement interconversion between the electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices. The radio frequency module may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. For example, the radio frequency module may interact with an external device through transmitted or received electromagnetic waves. For example, the radio frequency module may send instructions to the target device.

Referring to fig. 15, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 800 has stored therein program code that can be called by a processor to execute the methods described in the above-described method embodiments.

The computer-readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 800 includes a non-volatile computer-readable storage medium. The computer readable storage medium 800 has storage space for program code 810 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 810 may be compressed, for example, in a suitable form.

In summary, according to the voice control method, the device and the electronic device provided by the application, after the pinyin content corresponding to the voice control instruction is obtained as the first pinyin content and the pinyin content of the description information to be selected is obtained as the second pinyin contents, if it is determined that no second pinyin content is successfully matched with the first pinyin content, the pinyin content similar to the first pinyin content is obtained as the third pinyin content, the third pinyin content is matched with the second pinyin contents, the description information corresponding to the second pinyin content and successfully matched with the third pinyin content is used as the target description information, and the control operation corresponding to the target description information is executed.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for voice control, the method comprising:

acquiring a first pinyin content and a plurality of second pinyin contents, wherein the first pinyin content is the pinyin content corresponding to the acquired voice control instruction, the second pinyin contents comprise the pinyin content of description information to be selected, and the description information is information for describing corresponding operation;

obtaining third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, wherein the third pinyin content is similar to the first pinyin content;

matching the third pinyin content with the plurality of second pinyin contents, and taking the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information;

and executing the control operation corresponding to the target description information.

2. The method of claim 1, wherein the obtaining the third pinyin content includes:

acquiring similar phonemes corresponding to the designated phonemes in the first pinyin content;

and replacing the appointed phoneme in the first pinyin content by the similar phoneme to obtain a third pinyin content.

3. The method of claim 2, wherein the similar elements are plural, and wherein replacing the designated element in the first pinyin content with the similar element to obtain a third pinyin content includes:

and replacing the appointed phoneme in the first pinyin content by using the similar phonemes to obtain the first pinyin content which is subjected to phoneme replacement and corresponds to the similar phonemes, and using the first pinyin content as a third pinyin content.

4. The method of claim 2, wherein the designated element has a plurality of elements, and wherein replacing the designated element in the first pinyin content with the similar element to obtain a third pinyin content includes:

combining similar phonemes corresponding to at least two appointed phonemes with each other to obtain a plurality of phoneme pairs, wherein each phoneme pair comprises a similar phoneme corresponding to each of the at least two appointed phonemes;

respectively replacing the designated phoneme corresponding to the first pinyin content based on the plurality of phonemes to obtain first replaced pinyin content corresponding to each phoneme;

replacing the corresponding designated phoneme in the first pinyin content by using the similar phonemes corresponding to the designated phonemes to obtain second replaced pinyin content corresponding to each designated phoneme;

and taking the first alternative pinyin content and the second alternative pinyin content as third pinyin content.

5. The method of claim 2, wherein the obtaining similar phones corresponding to the designated phone in the first pinyin content includes:

inquiring whether the phonemes included in the first pinyin content have corresponding phoneme corresponding relations in a phoneme expansion table, wherein each phoneme corresponding relation represents a pair of similar phonemes;

and taking the phoneme determined to have the phoneme corresponding relation as a designated phoneme, and determining a similar phoneme corresponding to the designated phoneme based on the phoneme corresponding relation.

6. The method of claim 1, further comprising:

when the second pinyin content is successfully matched with the first pinyin content, using the description information of the successful matching of the corresponding second pinyin content and the first pinyin content as target description information;

7. The method of claim 1, wherein the matching the third pinyin content with the plurality of second pinyin content and the using description information of successful matching of the corresponding second pinyin content with the third pinyin content as target description information includes:

matching the third pinyin content with the plurality of second pinyin contents, wherein when the second pinyin content is successfully matched with the third pinyin content, the description information of the corresponding second pinyin content successfully matched with the third pinyin content is used as target description information;

when the second pinyin content is not successfully matched with the third pinyin content, acquiring the similarity between a plurality of second pinyin contents and the first pinyin content respectively to obtain the similarity corresponding to each second pinyin content;

and taking the description information corresponding to the second pinyin content with the maximum similarity as the target description information.

8. The method of claim 7, wherein the obtaining the similarity between each of the plurality of second pinyin contents and the first pinyin content to obtain the similarity corresponding to each of the plurality of second pinyin contents includes:

acquiring first reference similarity of a plurality of second pinyin contents and the first pinyin contents respectively based on the longest public subsequence to obtain first reference similarity corresponding to each second pinyin content;

acquiring second reference similarity of a plurality of second pinyin contents and the first pinyin contents respectively based on the editing distance to obtain second reference similarity corresponding to each second pinyin content;

and adding the first reference similarity and the second reference similarity corresponding to each second pinyin content to obtain the similarity corresponding to each second pinyin content.

9. The method of claim 7, wherein the using the description information corresponding to the second pinyin content with the highest similarity as the target description information includes:

if one second pinyin content with the maximum corresponding similarity exists, the description information corresponding to the second pinyin content with the maximum corresponding similarity is used as target description information;

if a plurality of second pinyin contents with the maximum corresponding similarity exist, acquiring a text vector of the text content corresponding to the voice control instruction as a first text vector;

acquiring text vectors corresponding to description information corresponding to the second pinyin contents with the maximum similarity to obtain a plurality of second text vectors;

respectively calculating vector distances between a plurality of second text vectors and the first text vector;

and taking the description information corresponding to the second text vector with the minimum corresponding vector distance as the target description information.

10. The method of claim 1, wherein obtaining a plurality of second pinyin contents includes:

obtaining the respective description information of a plurality of controls included in a target interface as description information to be selected;

and converting the description information to be selected into corresponding pinyin contents so as to obtain a plurality of second pinyin contents.

11. A voice control apparatus, characterized in that the apparatus comprises:

a first pinyin content and second pinyin content obtaining unit, configured to obtain a first pinyin content and obtain a plurality of second pinyin contents, where the first pinyin content is a pinyin content corresponding to the obtained voice control instruction, the plurality of second pinyin contents include pinyin contents of description information to be selected, and the description information is information used for describing corresponding operations;

a third pinyin content obtaining unit, configured to obtain a third pinyin content when the second pinyin content is not successfully matched with the first pinyin content, where the third pinyin content is a pinyin content similar to the first pinyin content;

the pinyin content matching unit is used for matching the third pinyin content with the plurality of second pinyin contents and taking the description information of the successful matching of the corresponding second pinyin content and the third pinyin content as target description information;

and the control operation execution unit is used for executing the control operation corresponding to the target description information.

12. An electronic device comprising one or more processors and memory;

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-10.

13. A computer-readable storage medium, having program code stored therein, wherein the method of any of claims 1-10 is performed when the program code is run.

14. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method according to any of claims 1-10.