CN112114926B

CN112114926B - Page operation method, device, equipment and medium based on voice recognition

Info

Publication number: CN112114926B
Application number: CN202011028860.8A
Authority: CN
Inventors: 向伟; 许峻华
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2024-08-09
Anticipated expiration: 2040-09-25
Also published as: JP7242737B2; JP2021099887A; CN112114926A; KR20210042853A

Abstract

The application discloses a page operation method, a device, electronic equipment and a medium based on voice recognition, which relate to the field of natural language processing, in particular to the fields of voice recognition, voice interaction, cloud computing and the like. A method of page operation based on speech recognition, a page comprising at least one control element, the method comprising: recognizing the received voice to obtain a voice recognition result; acquiring a Chinese-form text description set for each control element; determining a target control element from at least one control element, wherein the pinyin of the word description of the target control element is matched with the pinyin of the voice recognition result; and executing the control operation associated with the target control element and displaying the voice recognition result, wherein in the case that the voice recognition result does not match with the text description of the target control element, the voice recognition result is replaced with the text description of the target control element for display.

Description

Page operation method, device, equipment and medium based on voice recognition

Technical Field

The present application relates to the field of natural language processing, in particular, to the fields of speech recognition, speech interaction, and cloud computing, and more particularly, to a method, apparatus, device, and medium for page operation based on speech recognition.

Background

When the control operation is performed on the control elements on the page, the user can directly click the control elements on the page, or can perform the control operation on the control elements on the page through voice. However, when the related art performs the control operation on the control element on the page through the voice, there is a problem that the recognition rate is low due to the error of the voice recognition result, so that the use experience of the user is reduced.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for a page operation device based on voice recognition.

According to a first aspect, the present application provides a method of operating a page based on speech recognition, the page comprising at least one control element, the method comprising: and recognizing the received voice, obtaining a voice recognition result, obtaining a Chinese-form word description set for each control element, determining a target control element from the at least one control element, wherein the pinyin of the word description of the target control element is matched with the pinyin of the voice recognition result, executing control operation associated with the target control element, and displaying the voice recognition result, wherein the voice recognition result is replaced with the word description of the target control element for display under the condition that the voice recognition result is not matched with the word description of the target control element.

According to a second aspect, the present application provides a speech recognition based page operating device, the page comprising at least one control element, the device comprising: the device comprises an identification module, an acquisition module, a determination module and a display module. The voice recognition module is used for recognizing received voice to obtain a voice recognition result, the acquisition module is used for acquiring Chinese-form word descriptions set for each control element, the determination module is used for determining a target control element from the at least one control element, the pinyin of the word descriptions of the target control element is matched with the pinyin of the voice recognition result, the display module is used for executing control operation associated with the target control element and displaying the voice recognition result, and the voice recognition result is replaced with the word descriptions of the target control element to be displayed under the condition that the voice recognition result is not matched with the word descriptions of the target control element.

According to a third aspect, the present application provides an electronic device comprising: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect, the present application provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above method.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 schematically illustrates an application scenario of a speech recognition based page operation according to an embodiment of the present application;

FIG. 2 schematically illustrates a flow chart of a method of page operation based on speech recognition according to an embodiment of the application;

FIG. 3 schematically illustrates a flow chart of determining target control elements according to an embodiment of the application;

FIG. 4 schematically illustrates a schematic diagram of an alternative speech recognition result according to an embodiment of the present application;

FIG. 5 schematically illustrates an alternative speech recognition result according to another embodiment of the present application;

FIG. 6 schematically illustrates an alternative speech recognition result according to another embodiment of the present application;

FIG. 7 schematically illustrates an alternative speech recognition result according to another embodiment of the present application;

FIG. 8 schematically illustrates a flow chart of a method of page operation based on speech recognition according to another embodiment of the application;

FIG. 9 schematically illustrates an alternative speech recognition result according to another embodiment of the present application;

FIG. 10 schematically illustrates a page diagram according to an embodiment of the application;

FIG. 11 schematically illustrates a page view according to another embodiment of the application;

FIG. 12 schematically illustrates a block diagram of a speech recognition based page operating apparatus in accordance with an embodiment of the present application; and

FIG. 13 is a block diagram of an electronic device for implementing a speech recognition based page operating method in accordance with an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the application provides a page operation method based on voice recognition, wherein a page comprises at least one control element, and the method comprises the following steps: recognizing the received voice to obtain a voice recognition result, acquiring a Chinese-form word description set for each control element, determining a target control element from at least one control element, matching the pinyin of the word description of the target control element with the pinyin of the voice recognition result, executing a control operation associated with the target control element, and displaying the voice recognition result, wherein the voice recognition result is replaced with the word description of the target control element for display under the condition that the voice recognition result is not matched with the word description of the target control element.

Fig. 1 schematically shows an application scenario of a speech recognition based page operation according to an embodiment of the application.

As shown in fig. 1, an application scenario 100 of an embodiment of the present application includes, for example, a page 110. The page 110 may be a page displayed on an electronic device. Electronic devices may include, but are not limited to, smartphones, computers, smart speakers.

The page 110 has a plurality of control elements displayed thereon, for example. The electronic device may perform a control operation associated with the control element in response to a click operation or a touch operation of the user on the control element. The clicking operation may for example comprise execution by an input device, which may be a mouse. When the electronic device includes a touch screen, the touch operation may be performed by a user's finger, for example.

In one example, the control element may be text or a picture. For example, the control elements 111, 112, 113 are text, and the control elements 114, 115 are pictures. The user may click or touch each control element, and the electronic device may perform a control operation associated with the control element in response to the user's click or touch.

For example, control element 111 may be the word "movie", control element 112 may be the word "television show", control element 113 may be the word "documentary", control element 114 may be a picture of a certain movie (e.g., the movie "hero"), and control element 115 may be a picture of a certain television show (e.g., the television show "western watch").

When the user clicks or touches the control element 111, the electronic device may recommend a movie list to the user in response to the clicking operation or the touching operation of the control element 111 by the user. When the user clicks or touches the control element 112, the electronic device may recommend a list of television shows to the user in response to the user clicking or touching the control element 112. When the user clicks or touches the control element 113, the electronic device may recommend a documentary list to the user in response to the clicking operation or the touching operation of the user with respect to the control element 113. When the user clicks or touches the control element 114, the electronic device may play a movie "hero" for the user in response to the user's clicking or touching operation with respect to the control element 114, and when the user clicks or touches the control element 115, the electronic device may play a television episode "western-style diary" for the user in response to the user's clicking or touching operation with respect to the control element 115.

In another embodiment, the user may operate on individual control elements by way of voice interaction. For example, when a user desires to view a playlist of movies, the user may issue a voice "movie", and the electronic device performs a control operation associated with the control element 111 to recommend the movie list to the user in response to the user's voice "movie". When the user desires to view the playlist of a television show, the user may make a voice "television show," and the electronic device performs a control operation associated with the control element 112 to recommend the television show list to the user in response to the user's voice "television show. When the user needs to view the list of documentaries, the user may issue a voice "documentaries", and the electronic device performs a control operation associated with the control element 113 in response to the voice "documentaries" of the user to recommend the list of documentaries to the user. When the user desires to watch the movie "hero", the user may make a voice "hero", and the electronic device performs a control operation associated with the control element 114 in response to the user's voice "hero" to play the movie "hero" for the user. When the user desires to view the television play "western style" the user may issue a voice "western style" and the electronic device performs a control operation associated with the control element 115 to play the television play "western style" for the user in response to the user's voice "western style".

According to the embodiment of the application, the control elements on the page are operated by voice, so that the efficiency of operating the page by a user is improved. In addition, the page is operated in a voice interaction mode, so that the complexity of page operation is reduced, and the use experience of a user is improved.

The embodiment of the application provides a page operation method based on voice recognition, and the page operation method based on voice recognition according to an exemplary embodiment of the application is described below with reference to fig. 2 to 11 in combination with the application scenario of fig. 1.

Fig. 2 schematically shows a flow chart of a method of page operation based on speech recognition according to an embodiment of the application.

In an embodiment of the present application, a page of an electronic device may include at least one control element, each control element including a textual description about the control element. The user may operate on control elements in the page by voice.

As shown in fig. 2, the page operation method 200 based on voice recognition according to the embodiment of the present application may include operations S210 to S240, for example.

In operation S210, the received voice is recognized, and a voice recognition result is obtained.

In operation S220, a text description in chinese form set for each control element is acquired.

In operation S230, a target control element is determined from the at least one control element, and pinyin of the text description of the target control element matches pinyin of the speech recognition result.

In operation S240, a control operation associated with the target control element is performed and a voice recognition result is displayed, and in the case where the voice recognition result does not match the text description of the target control element, the voice recognition result is replaced with the text description of the target control element to be displayed.

According to an embodiment of the application, a textual description of the control element may be used to define the control element, which may be chinese. Control elements of the present application may include, but are not limited to, text, pictures, or a combination of both. When the control element is a text, the text description of the control element may be the control element itself, for example, when the control element is the text "movie", the text description of the control element may be "movie", and the text description "movie" may be displayed on the page. When the control element is a picture, the textual description of the control element may be a text for describing the control element, for example, when the control element is a picture of the movie "hero", the textual description of the control element may be "hero", which may be stored at the bottom layer but not displayed on the page. When the control element is a combination of a picture and a text, the text description of the control element may be the text itself contained in the control element, for example, when the control element is a picture and a text of a movie "hero" (the text may be the text "hero" displayed adjacent to the picture), the text description of the control element may be the text "hero" which may be displayed on the page.

In the embodiment of the application, when the voice of the user is received, the received voice can be subjected to recognition processing to obtain a voice recognition result, and the voice recognition result can be text information. After the voice recognition result is obtained by recognition, the pinyin of the voice recognition result and the pinyin of the character description of at least one control element can be matched, and the control element of which the pinyin of the character description of at least one control element is matched with the pinyin of the voice recognition result is taken as the target control element.

Since the speech recognition result may include text information, the textual description of the target control element may be matched with the speech recognition result after the target control element is determined. If the voice recognition result is not matched with the text description of the target control element, the voice recognition can be indicated to have recognition errors, and if the voice recognition result obtained by recognition is directly displayed on the page, the user can watch the wrong voice recognition result when watching the page, so that the user experience is poor. Therefore, under the condition that the pinyin of the voice recognition result and the pinyin of the word description of the target control element are matched, the voice of the user can be determined to be specific to the target control element, and the voice recognition result is inconsistent with the word description of the target control element which is specific to the user only because of poor recognition effect.

For example, when the speech recognition result is "drama", the text description of the control element is "drama", the pinyin "dianshiju" of the "drama" and the pinyin "dianshiju" of the "drama" match, and at this time, the control element, the text description of which is "drama", may be determined as the target control element. Then, the voice recognition result "electric potential play" and the word description "television play" of the target control element are matched, and as the voice recognition result "electric potential play" and the word description "television play" of the target control element are not matched, the voice recognition result "electric potential play" is the wrong voice recognition result, and at the moment, the voice recognition result "electric potential play" can be replaced by the word description "television play" of the target control element and displayed. That is, the replaced voice recognition result is "drama", and the correct voice recognition result "drama" is displayed on the page, so that the user views the correct voice recognition result when viewing, thereby improving the viewing experience of the user.

Therefore, in order to correct the situation of the recognition error, in the embodiment of the application, under the condition that the voice recognition result and the text description of the target control element are not matched, the voice recognition result can be replaced by the text description of the target control element, and the voice recognition result obtained by replacement is displayed on the page. Therefore, through the technical scheme of the embodiment of the application, the page display accuracy is improved, namely, the page display accuracy voice recognition result is improved, and the user experience of watching the page is improved.

In one example, the control operation associated with the target control element may be performed directly after the target control element is determined based on matching the pinyin of the speech recognition result and the pinyin of the textual description of the control element. The control operation is directly executed when the pinyin is matched, so that the response speed of the page operation can be improved, and the response time of the page operation can be reduced.

In another example, in the case where the speech recognition result and the word description of the target control element do not match, the control operation associated with the target control element is performed by replacing the speech recognition result with the word description of the target control element and displaying it on the page at the same time or later. That is, when the correct voice recognition result is displayed, the control operation is executed, so that the user perceives that the display of the correct voice recognition result and the control operation of the page are almost synchronous, thereby reducing the time delay between the display of the correct voice recognition result and the control operation of the page, and improving the use experience of the user.

FIG. 3 schematically illustrates a flow chart for determining target control elements according to an embodiment of the application.

As shown in fig. 3, in the embodiment of the present application, determining, from at least one control element, a control element whose pinyin of a text description matches that of a speech recognition result as a target control element may include, for example, operations S321 to S323.

In operation S321, the voice recognition result is converted into pinyin.

In operation S322, the text description of each control element is converted into pinyin.

In operation S323, the pinyin of the voice recognition result is matched with the pinyin of the word description of each control element, and a control element having the pinyin matched with the pinyin of the voice recognition result is determined as the target control element.

In the embodiment of the application, since the voice recognition result obtained by recognition is text information, the voice recognition result can be converted into pinyin, the text description of each control element can be converted into pinyin, and then the pinyin of the voice recognition result obtained by conversion and the pinyin of the text description of each control element obtained by conversion are matched so as to determine the target control element from at least one control element, and the pinyin of the text description of the determined target control element is matched with the pinyin of the voice recognition result.

According to the embodiment of the application, after the voice recognition result and the text description of each control element are converted into pinyin, the target control element is determined through pinyin matching, so that the matching accuracy is improved, and the situation that the error voice recognition result obtained by correct voice but incorrect voice recognition of a user cannot be matched with the text description of each control element is avoided. Namely, the target control element aimed by the user voice can be determined more quickly and accurately in a pinyin matching mode, so that the matching accuracy and the matching efficiency of the target control element are improved.

Fig. 4 schematically shows a schematic diagram of an alternative speech recognition result according to an embodiment of the application.

As shown in fig. 4, a plurality of control elements are displayed on a page as an example. And matching the pinyin of the voice recognition result with the pinyin of the word description of each control element to determine the target control element, then matching the voice recognition result with the word description of the target control element, and if the voice recognition result is not matched with the word description of the target control element, replacing the voice recognition result with the word description of the target control element.

For example, the plurality of control elements displayed on the page include text and pictures. For example, the plurality of control elements 401, 402, 403, 404, 405 are respectively the words "movie", the words "drama", the words "recorded film", a picture of a certain movie (e.g., movie "hero"), a picture of a certain drama (e.g., drama "western pleasure"). For the literal control elements 401, 402, 403, the literal description of each control element is its own control element. For the control elements 404, 405 of the picture class, the literal description of the control elements 404, 405 is, for example, "hero", "westernship", respectively. The speech recognition result 406 is, for example, "electric potential play". By matching the pinyin of the speech recognition result "drama" with the pinyin of the word description of each control element, the matched control element is determined as the target control element, for example, the pinyin "dianshiju" of the control element "drama" and the pinyin "dianshiju" of the speech recognition result "drama" are matched, and the control element "drama" is determined as the target control element.

Next, the speech recognition result "drama" is matched with the text description "drama" of the target control element, and because the two are inconsistent, the speech recognition result 406 is replaced with the text description of the target control element to obtain a replaced speech recognition result 406', and the replaced speech recognition result 406' is, for example, "drama".

Fig. 5 schematically shows a schematic diagram of an alternative speech recognition result according to another embodiment of the application.

As shown in fig. 5, a plurality of control elements are displayed on a page as an example. The textual description of each control element includes a plurality of sub-portions. The pinyin of the speech recognition result is matched with the pinyin of each of the plurality of sub-portions of each control element. Then, a control element having a pinyin with at least one sub-portion that matches the pinyin of the speech recognition result is determined as the target control element. Next, the voice recognition result is matched with at least one sub-part of the target control element, and if the voice recognition result is not matched with the target control element, the voice recognition result is replaced with at least one sub-part in the text description of the target control element for display.

For example, the plurality of control elements displayed on the page include text and pictures. For example, the plurality of control elements 501, 502, 503, 504, 505 are respectively the words "movie", the words "drama", the words "recorded film", a picture of a certain movie (e.g., movie "hero"), a picture of a certain drama (e.g., drama "western pleasure"). For literal control elements 501, 502, 503, the literal of each control element is described as its control element itself. For the control elements 504, 505 of the picture class, the respective textual description of the control elements 504, 505 for example comprises a plurality of sub-parts.

Taking the control element 505 as an example, the text description of the control element 505 includes, for example, a plurality of sub-portions 505A, 505B, 505C, the plurality of sub-portions 505A, 505B, 505C being, for example, "western character," "actor XXX," and "twenty-five set," respectively.

The speech recognition result 506 is, for example, "playfulness". The pinyin of the "play" speech recognition result is matched with the pinyin of each of the plurality of sub-portions of each control element, for example, the pinyin of the "play" speech recognition result is matched with the pinyin of each of the plurality of sub-portions of the control element 504, and if none of the pinyin of the plurality of sub-portions matches the pinyin of each of the sub-portions, the pinyin of the "play" speech recognition result is continuously matched with the pinyin of each of the plurality of sub-portions of the control element 505 to obtain a matching result. The matching result is, for example, a pinyin match of the sub-portion 505A (e.g., "western parade") in the control element 505 and a pinyin match of the speech recognition result "playful parade", thereby determining the control element 505 as the target control element.

Next, the speech recognition result "playfulness" is matched with the word description of the target control element "western-play", and because the two are inconsistent, the speech recognition result 506 is replaced with the sub-part "western-play" in the word description of the target control element to obtain a replaced speech recognition result 506', and the replaced speech recognition result 506' is, for example, "western-play".

It may be understood that, since the text description of the control element in the embodiment of the present application includes a plurality of sub-parts, the target control element is determined by matching the pinyin of the speech recognition result with the pinyin of each sub-part, the text description in the obtained target control element includes a sub-part that matches the pinyin of the speech recognition result, and then the sub-part is matched with the speech recognition result, and if the sub-part is not matched with the speech recognition result, the speech recognition result is replaced with the sub-part. That is, when the voice recognition result is replaced, the voice recognition result is replaced by the subsection of the text description of the target control element, so that the voice recognition result can be replaced in a targeted manner, and the replaced voice recognition result meets the requirements of users.

Fig. 6 schematically shows a schematic diagram of an alternative speech recognition result according to another embodiment of the application.

As shown in fig. 6, the speech recognition result includes, for example, a piece of text. The method comprises the steps of determining key words in a voice recognition result, matching the pinyin of the key words with the pinyin of the word description of the control elements, and then determining the control elements with the pinyin matched with the pinyin of the key words as target control elements. And then, matching the keywords in the voice recognition result with the word description of the target control element, and if the keywords are not matched with the word description of the target control element, replacing the keywords in the voice recognition result with the word description of the target control element and displaying the word description.

For example, the plurality of control elements displayed on the page include text and pictures. For example, the plurality of control elements 601, 602, 603, 604, 605 are respectively the words "movie", the words "drama", the words "documentary", a picture of a certain movie (e.g., movie "hero"), a picture of a certain drama (e.g., drama "western pleasure"). For the literal control elements 601, 602, 603, the literal description of each control element is its own control element. For the control elements 604, 605 of the picture class, the literal description of the control elements 604, 605 is, for example, "hero", "westernship", respectively.

The speech recognition result 606 is, for example, "play playfulness", and the speech recognition result 606 is a piece of text. Keywords 606A in the speech recognition results 606 may be determined. For example, the part of speech of each word in the speech recognition result 606 is determined, and then nouns in the speech recognition result 606 are used as keywords, for example, "playfulness" in the speech recognition result 606 is used as keywords 606A.

Next, the pinyin of the keyword 606A (i.e., "playfulness") in the speech recognition result 606 and the pinyin of the word description of each control element are matched, and the matched control element is determined as the target control element. For example, if the word describing "the western game" of the control element 605 is pinyin "xiyouji" and the keyword "the game" of the speech recognition result 606 is pinyin "xiyouji" match, the control element 605 is determined to be the target control element.

Next, the keyword "play-in-the-game" in the speech recognition result 606 is matched with the word description "play-in-the-game" of the target control element, and because the keyword "play-in-the-game" in the speech recognition result 606 is inconsistent with the word description "play-in-the-game" of the target control element, the keyword "play-in-the-game" in the speech recognition result 606 is replaced with the word description "play-in-the-game" of the target control element, so that a replaced speech recognition result 606' is obtained. The replaced speech recognition result 606' is, for example, "please play a western character", wherein the noun keyword 606A ' in the replaced speech recognition result 606' is "western character".

It will be appreciated that when the speech recognition result includes a segment of text, the text other than the keyword in the speech recognition result is generally a universal text, so that the recognition accuracy of the text other than the keyword is generally high, so that the target control element can be determined by determining the keyword in the speech recognition result and matching the pinyin of the keyword with the pinyin of the text description of each control element, then matching the text description of the target control element with the speech recognition result, and if the text description of the target control element and the speech recognition result do not match, replacing the speech recognition result with the text description of the target control element. That is, when the voice recognition result is replaced, the keywords in the voice recognition result can be pertinently matched and replaced, so that the matching and replacing efficiency is improved, the calculation amount required in the matching and replacing process is reduced, and the voice recognition result after replacement has a smaller change degree compared with the voice recognition result before replacement, so that the voice recognition result after replacement meets the requirements of users more.

Fig. 7 schematically shows a schematic diagram of an alternative speech recognition result according to another embodiment of the application.

As shown in fig. 7, the speech recognition result includes, for example, a piece of text, and the text description of each control element includes a plurality of sub-parts. By determining a keyword in the speech recognition result and matching the pinyin of the keyword with the pinyin of each of the plurality of sub-portions of each control element. Then, a control element having at least one sub-portion of pinyin matching the pinyin of the keyword of the speech recognition result is determined as a target control element. And then, matching the keywords of the voice recognition result with at least one subsection of the target control element, and if the keywords are not matched with the at least one subsection of the word description of the target control element, replacing the keywords of the voice recognition result with the at least one subsection of the word description of the target control element and displaying the keywords.

For example, the plurality of control elements displayed on the page include text and pictures. For example, the plurality of control elements 701, 702, 703, 704, 705 are respectively the words "movie", the words "drama", the words "recorded film", a picture of a certain movie (e.g., movie "hero"), a picture of a certain drama (e.g., drama "western pleasure"). For the literal control elements 701, 702, 703, the literal of each control element is described as its control element itself. For the control elements 704, 705 of the picture class, the respective textual description of the control elements 704, 705 for example comprises a plurality of sub-parts.

Taking the control element 705 as an example, the text description of the control element 705 includes a plurality of sub-portions 705A, 705B, 705C, the plurality of sub-portions 705A, 705B, 705C being, for example, "western character", "actor XXX", "twenty-five set", respectively.

The speech recognition result 706 is, for example, "play playfulness", and the speech recognition result 706 is a piece of text. For example, the parts of speech of each word in the speech recognition result 706 is determined, and then the nouns in the speech recognition result 706 are used as keywords 706A. For example, "playfulness" in the speech recognition result 706 is used as the keyword 706A.

Next, the pinyin of the keyword 706A (i.e., "playfulness") in the speech recognition result 706 is matched with the pinyin of each of the plurality of sub-portions of each control element. For example, the pinyin of the keyword "play" in the speech recognition result 706 and the pinyin of each of the plurality of sub-parts of the control element 705 are matched to obtain a matching result, and the matching result is that the pinyin of the sub-part "western play" in the control element 705 and the pinyin of the keyword "play" in the speech recognition result 706 are matched, thereby determining the control element 705 as the target control element.

Next, the keyword "play-in-the-game" in the speech recognition result 706 is matched with the word description "play-in-the-game" of the target control element, and because the keyword "play-in-the-game" in the speech recognition result 706 is inconsistent with the word description "play-in-the-game" of the target control element, the keyword "play-in-the-game" in the speech recognition result 706 is replaced with the word description "play-in-the-game" of the target control element to obtain a replaced speech recognition result 706', and the replaced speech recognition result 706' is, for example, "play-in-the-game" in which the noun keyword 706A 'in the replaced speech recognition result 706' is "play-in-the-game".

It will be appreciated that when the speech recognition result includes a segment of text, the text other than the keyword in the speech recognition result is generally a universal text, so that the recognition accuracy of the text other than the keyword is generally high, and thus the target control element can be determined by determining the keyword in the speech recognition result and matching the pinyin of the keyword with the pinyin of a plurality of sub-portions of each control element, then matching the sub-portions of the target control element with the speech recognition result, and if the sub-portions do not match, replacing the speech recognition result with the sub-portions of the target control element. That is, when the voice recognition result is replaced, the keyword in the voice recognition result can be replaced with the sub-part of the text description of the target control element in a targeted manner, so that the efficiency of matching and replacement is improved, the calculation amount required in the matching and replacement process is reduced, and the voice recognition result after replacement has a smaller change degree compared with the voice recognition result before replacement, so that the voice recognition result after replacement meets the requirements of users more.

Fig. 8 schematically shows a flow chart of a page operation method based on speech recognition according to another embodiment of the application.

As shown in fig. 8, the page operation method 800 based on voice recognition according to the embodiment of the present application may include operations S810 to S880, for example, wherein operation S840 includes operations S841 to S843, for example.

In operation S810, the received voice is recognized to obtain a voice recognition result.

In operation S820, a text description in chinese form set for each control element is acquired.

In operation S830, a target control element is determined from the at least one control element, whose pinyin for the text description matches the pinyin for the speech recognition result.

In operation S840, a control operation associated with the target control element is performed and the voice recognition result is displayed, and in the case where the voice recognition result does not match the text description of the target control element, the voice recognition result is replaced with the text description of the target control element to be displayed. Wherein operation S840 includes, for example, operations S841 to S843.

In operation S841, a control operation associated with the target control element is performed and a voice recognition result is displayed.

In operation S842, it is determined whether the speech recognition result matches the textual description of the target control element. If there is no match, operation S843 is performed, and if there is a match, it may end.

In operation S843, a text description in which the voice recognition result is replaced with the target control element is displayed.

After performing operation S820 and before performing operation S830, operation S850 and operation S860 may be performed.

In operation S850, the voice recognition result is converted into pinyin, and the text description of each control element is converted into pinyin.

In operation S860, the pinyin of the speech recognition result and the pinyin of the word description of each control element are matched to determine whether the pinyin of the speech recognition result and the pinyin of the word description of each control element match. If there is a match, operation S830 is performed, and if there is no match, operation S870 is performed.

In operation S870, in case that the pinyin of the speech recognition result does not match the pinyin of the text description of each of the at least one control element, the semantic analysis is performed on the speech recognition result to obtain a semantic analysis result.

In operation S880, an application for which the semantic analysis result is directed is started based on the voice analysis result.

For example, when the voice recognition result is "please start navigation", if the pinyin of the voice recognition result and the pinyin of the text description of each control element are not matched, semantic analysis can be performed on the voice recognition result to obtain a semantic analysis result, the semantic analysis result characterizes that the user needs to start the map application program to perform navigation, and at this time, the map application program can be started based on the semantic analysis result.

It can be understood that when the pinyin of the voice recognition result and the pinyin of the text description of each control element are not matched, the embodiment of the application can obtain the semantic analysis result representing the voice intention of the user by carrying out semantic analysis on the voice recognition result, and start the application program aimed at by the semantic analysis result based on the semantic analysis result, so as to respond to the voice of the user in different ways to meet the requirements of the user and improve the use experience of the user.

Fig. 9 schematically shows a schematic diagram of an alternative speech recognition result according to another embodiment of the application.

As shown in fig. 9, after recognizing the received voice to obtain a voice recognition result, the voice recognition result obtained by the recognition may be directly displayed on the page. When it is determined that the speech recognition result does not match the text description of the target control element, the speech recognition result may be replaced with the text description of the target control element, and then the replaced speech recognition result may be displayed on the page to cover the original speech recognition result. Specifically, the replaced keywords in the speech recognition result obtained after the replacement may be displayed on the page to cover the keywords in the original speech recognition result.

Taking the speech recognition result 901 as an example of "please play the playful record", the "please play the playful record" is displayed on the page. The keyword 901A (for example, "playing the pleasure-note") in the voice recognition result 901 is not matched with the word description "western-note" of the target control element, so that the keyword "playing the pleasure-note" in the voice recognition result 901 is replaced by the word description "western-note" of the target control element, the replaced voice recognition result 901' is "play the western-note" and the replaced voice recognition result "play the western-note" is displayed on the page. The "play please" of the speech recognition result 901 (e.g., "play playful") originally displayed on the page may be displayed on the page together with the keyword 901A '(e.g., "western playing jockey") of the replaced speech recognition result 901' (e.g., "play playful") on the page, i.e., "play please" of the speech recognition result "playful" displayed on the page is always displayed without being replaced, so as to realize the targeted replacement of the keyword for display.

In an embodiment of the application, a page is displayed on a touch screen of an electronic device. The user may touch a control element on the page, and the electronic device may perform a control operation associated with the touched control element in response to touching the control element on the page on the touch screen.

Fig. 10 schematically shows a page schematic according to an embodiment of the application.

As shown in fig. 10, the page of the embodiment of the present application may include a web page, the control element on the page includes at least one of a website 1001, a picture 1002, an icon 1003, and a text 1004, and the control operation associated with the target control element includes accessing a link address associated with at least one of the website 1001, the picture 1002, the icon 1003, and the text 1004.

For example, when the user touches the web site 1001, a control operation associated with the web site 1001 is performed, for example, to jump to a web page corresponding to the web site 1001. When the user touches a picture 1002 (the picture 1002 is, for example, a picture corresponding to a television play "western tour"), a control operation associated with the picture 1002 is performed, for example, to jump to a related information web page showing the television play "western tour". When the user touches an icon 1003 (the icon 1003 is, for example, a play icon), a control operation associated with the icon 1003 is performed, for example, play of a television series "western tour". When the user touches the text 1004, a control operation associated with the text 1004 is performed, for example, playing a television series "dream of the red blood cell" or jumping to a web page showing information about the television series "dream of the red blood cell".

Fig. 11 schematically shows a page view according to another embodiment of the application.

As shown in fig. 11, the page of the embodiment of the present application includes an interface of an application program, the control element includes at least one of a picture 1101, an icon 1102, and a text 1103, and the control operation associated with the target control element includes at least one of playing video, playing audio, and displaying a list.

When the user touches a picture 1101, a control operation associated with the picture 1101 is performed, for example, playing a song video. When the user touches an icon 1102, a control operation associated with the icon 1102 is performed, for example, to play audio, such as playing an "XXX song". When the user touches a text 1103, a control operation associated with the text 1103 is performed, for example, as a presentation list, for example, a presentation singer list.

It will be appreciated that the above-described pages are merely examples provided to facilitate understanding of the technical solutions of the embodiments of the present application, and the pages of the embodiments of the present application include, but are not limited to, the above-described pages, and the pages of the embodiments of the present application may include any form of page.

The page operation method based on the voice recognition in the embodiment of the application can be executed through cloud computing, for example, the page operation method based on the voice recognition can be executed in the cloud. Specifically, the processes of recognizing the voice of the user to obtain a voice recognition result, determining the target control element through pinyin conversion and pinyin comparison, matching the voice recognition result with the text description of the target control element, performing voice recognition on the voice recognition result and the like can be performed in the cloud to obtain an execution result, the cloud can send the execution result to the electronic device and store the execution result in the local of the electronic device, and the electronic device replaces the voice recognition result with the text description of the target control element and displays the text description.

Fig. 12 schematically shows a block diagram of a page operating apparatus based on speech recognition according to an embodiment of the application.

As shown in fig. 12, the page operating apparatus 1200 based on voice recognition according to the embodiment of the present application includes, for example, a recognition module 1210, an acquisition module 1220, a determination module 1230, and a display module 1240.

The recognition module 1210 may be configured to recognize received speech to obtain a speech recognition result. According to an embodiment of the present application, the identification module 1210 may perform, for example, the operation S210 described above with reference to fig. 2, which is not described herein.

The obtaining module 1220 is configured to obtain a text description in chinese form set for each control element. According to an embodiment of the present application, the obtaining module 1220 may, for example, perform the operation S220 described above with reference to fig. 2, which is not described herein.

The determining module 1230 may be configured to determine a target control element from the at least one control element, where the pinyin of the textual description of the target control element matches the pinyin of the speech recognition result. The determining module 1230 may perform, for example, the operation S230 described above with reference to fig. 2 according to an embodiment of the present application, which is not described herein.

The display module 1240 may perform a control operation associated with the target control element and display the speech recognition result, wherein in the event that the speech recognition result does not match the textual description of the target control element, the speech recognition result is replaced with the textual description of the target control element for display. According to an embodiment of the present application, the display module 1240 may, for example, perform the operation S240 described above with reference to fig. 2, which is not described herein.

According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product. The computer program product comprises a computer program which, when executed by a processor, can implement the method of any of the embodiments described above.

As shown in fig. 13, there is a block diagram of an electronic device 1300 that is based on a page operation method of speech recognition according to an embodiment of the present application. The electronic device 1300 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 13, the electronic device 1300 includes: one or more processors 1310, a memory 1320, and interfaces for connecting components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device 1300, including instructions stored in or on memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to an interface). In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices 1300 may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 1310 is shown in fig. 13 as an example.

Memory 1320 is a non-transitory computer-readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the voice recognition based page operating method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the voice recognition-based page operation method provided by the present application.

The memory 1320 is a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the recognition module 1210, the acquisition module 1220, the determination module 1230, and the display module 1240 shown in fig. 12) corresponding to the voice recognition-based page operation method in the embodiment of the present application. The processor 1310 performs various functional applications of the server and data processing, i.e., implements the speech recognition based page operating method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 1320.

Memory 1320 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device 1300 operating on a voice recognition based page, and the like. In addition, memory 1320 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1320 may optionally include memory located remotely from processor 1310, which may be connected to electronic device 1300 operating on speech recognition based pages via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device 1300 of the page operation method based on the voice recognition may further include: an input device 1330 and an output device 1340. Processor 1310, memory 1320, input device 1330, and output device 1340 may be connected by a bus or otherwise, as exemplified in fig. 13 by a bus connection.

The input device 1330 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device 1300 based on voice recognition page operations, such as a touch screen, keypad, mouse, trackpad, touch pad, pointer stick, one or more mouse buttons, trackball, joystick, etc. The output device 1340 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of page operation based on speech recognition, the page including at least one control element, the method comprising:

recognizing the received voice to obtain a voice recognition result, wherein the voice recognition result comprises characters;

Acquiring a Chinese-form text description set for each control element, wherein the text description of each control element in the at least one control element comprises a plurality of sub-parts;

matching the pinyin of the voice recognition result with the pinyin of the word description of each control element, including: matching the pinyin of the keyword in the voice recognition result with the pinyin of each of the plurality of sub-parts of each control element;

Determining a control element having a pinyin that matches the pinyin of the speech recognition result as a target control element, comprising: determining a control element comprising at least one sub-part in the at least one control element as a target control element, wherein the pinyin of the at least one sub-part is matched with the pinyin of the keyword of the voice recognition result;

performing a control operation associated with the target control element;

displaying the voice recognition result, wherein when at least one sub-part of the text description of the target control element and the keyword in the voice recognition result are not matched in text dimension, the voice recognition result is replaced by the text description of the target control element for displaying, and the method comprises the following steps: and replacing the keywords in the voice recognition result with at least one sub-part in the text description of the target control element for display.

2. The method of claim 1, further comprising:

Converting the voice recognition result into pinyin;

the literal description of each control element is converted into pinyin.

3. The method of claim 2, wherein,

The matching of the pinyin of the voice recognition result with the pinyin of the word description of each control element comprises the following steps: matching the pinyin of the speech recognition result with the pinyin of each of the plurality of sub-portions of each control element;

the determining a control element having a pinyin matching the pinyin of the speech recognition result as a target control element comprises: and determining a control element with at least one sub-part of pinyin matched with the pinyin of the voice recognition result as the target control element.

4. The method of claim 3, wherein the replacing the speech recognition result with the textual description of the target control element for display comprises:

And replacing the voice recognition result with the at least one subsection in the text description of the target control element for display.

5. The method according to claim 2, wherein:

The matching of the pinyin of the voice recognition result with the pinyin of the word description of each control element comprises the following steps: determining a keyword in the voice recognition result, and matching the pinyin of the keyword with the pinyin of the word description of the control element;

The determining a control element having a pinyin matching the pinyin of the speech recognition result as a target control element comprises: control elements having pinyin matching the pinyin of the keyword are determined as target control elements.

6. The method of claim 5, wherein the determining keywords in the speech recognition result comprises:

determining the part of speech of each word in the speech recognition result; and

And taking nouns in the voice recognition result as the keywords.

7. The method of claim 5, wherein the replacing the speech recognition result with the textual description of the target control element for display comprises:

And replacing the keywords in the voice recognition result with the text description of the target control element for display.

8. The method of claim 1, further comprising:

Under the condition that the pinyin of the voice recognition result is not matched with the pinyin of the word description of each control element in the at least one control element, carrying out semantic analysis on the voice recognition result to obtain a semantic analysis result; and

And starting an application program aimed at by the semantic analysis result based on the voice analysis result.

9. The method of any one of claims 1 to 8, wherein the page is displayed on a touch screen; the method further comprises the steps of:

a control operation associated with the touched control element is performed in response to a touch on the touch screen to the control element on the page.

10. The method of any of claims 1-8, wherein the page comprises a web page, the control element comprises at least one of a web address, a picture, an icon, and text, and the control operation associated with the target control element comprises accessing a link address associated with at least one of a web address, a picture, an icon, and text.

11. The method of any of claims 1-8, wherein the page comprises an interface of an application, a control element comprises at least one of a picture, an icon, and text, and a control operation associated with the target control element comprises at least one of playing video, playing audio, and displaying a list.

12. A page operating device based on speech recognition, the page comprising at least one control element, the device comprising:

the recognition module is used for recognizing the received voice to obtain a voice recognition result, wherein the voice recognition result comprises characters;

An acquisition module for acquiring a textual description in chinese form set for each control element, the textual description of each control element of the at least one control element comprising a plurality of sub-portions;

The matching module is used for matching the pinyin of the voice recognition result with the pinyin of the word description of each control element, and comprises the following steps: matching the pinyin of the keyword in the voice recognition result with the pinyin of each of the plurality of sub-parts of each control element;

A determining module, configured to determine, as a target control element, a control element having a pinyin that matches a pinyin of the speech recognition result, including: determining a control element comprising at least one sub-part in the at least one control element as a target control element, wherein the pinyin of the at least one sub-part is matched with the pinyin of the keyword of the voice recognition result;

an execution module for executing a control operation associated with the target control element;

The display module is configured to display the speech recognition result, where when a keyword in the speech recognition result and at least one subsection of the text description of the target control element are not matched in text dimension, replace the speech recognition result with the text description of the target control element for display, and include: and replacing the keywords in the voice recognition result with at least one sub-part in the text description of the target control element for display.

13. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 11.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 11.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 11.