CN115101069A

CN115101069A - Voice control method, device, equipment, storage medium and program product

Info

Publication number: CN115101069A
Application number: CN202210880520.0A
Authority: CN
Inventors: 华鲸州; 欧阳能钧; 邓天坚
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-09-23

Abstract

The present disclosure provides a voice control method, apparatus, device, storage medium, and program product, which relate to the field of artificial intelligence, and in particular to technologies of natural language processing, voice recognition, image recognition, internet of vehicles, and intelligent cabins, and can be applied in a voice control scenario. One embodiment of the method comprises: responding to the received voice, and acquiring a screenshot of the current page; carrying out voice recognition on the voice to obtain a voice recognition result; determining an image recognition result of the page by utilizing the screenshot, wherein the image recognition result comprises a keyword and a position, the keyword comprises the keyword in the page, and the position comprises the position of the keyword in the page; searching keywords matched with the voice recognition result in the image recognition result as target keywords; and performing simulation operation at the position of the target keyword in the page to generate a voice response result. The embodiment realizes a voice control method based on image recognition.

Description

Voice control method, device, equipment, storage medium and program product

Technical Field

The embodiment of the disclosure relates to the field of artificial intelligence, in particular to natural language processing, voice recognition, image recognition, Internet of vehicles and intelligent cabin technologies, and can be applied to voice control scenes.

Background

With the development of operating systems of user terminals, the control modes of users for user terminals gradually develop from the earliest command line type operation control to the visual operation control of a mouse and the like, to the touch screen operation control and the like.

In addition, in some application scenarios (such as driving, cooking, and the like) where it is inconvenient for the user to perform mouse/keyboard-based visual operation or touch screen operation, if the user terminal supports voice control, the user can also perform operation control on the user terminal through voice.

Disclosure of Invention

Embodiments of the present disclosure provide a voice control method, apparatus, device, storage medium, and program product.

In a first aspect, an embodiment of the present disclosure provides a voice control method, including: responding to the received voice, and acquiring a screenshot of the current page; carrying out voice recognition on the voice to obtain a voice recognition result; determining an image recognition result of the page by utilizing the screenshot, wherein the image recognition result comprises a keyword and a position, the keyword comprises the keyword in the page, and the position comprises the position of the keyword in the page; searching keywords matched with the voice recognition result in the image recognition result as target keywords; and performing simulation operation at the position of the target keyword in the page to generate a voice response result.

In a second aspect, an embodiment of the present disclosure provides a voice control apparatus, including: a screenshot obtaining module configured to obtain a screenshot of a current page in response to receiving speech; the voice recognition module is configured to perform voice recognition on the voice to obtain a voice recognition result; an image recognition module configured to determine an image recognition result of the page using the screenshot, wherein the image recognition result includes a keyword and a location, the keyword includes the keyword in the page, and the location includes a location of the keyword in the page; a matching module configured to search a keyword matched with the voice recognition result in the image recognition result as a target keyword; and the response module is configured to perform simulation operation at the position of the target keyword in the page and generate a response result of the voice.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any implementation manner of the first aspect.

In a fourth aspect, the disclosed embodiments propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

In a fifth aspect, the present disclosure provides a computer program product including a computer program, which when executed by a processor implements the method as described in any implementation manner of the first aspect.

According to the voice control method, the device, the equipment, the storage medium and the program product provided by the embodiment of the disclosure, the screenshot of the current page is obtained when the voice is received, the keyword in the current page and the position of the keyword are determined by utilizing the screenshot, the keyword matched with the voice recognition result corresponding to the voice in the page is determined as the target keyword, and the simulation operation is carried out on the position of the target keyword in the page to obtain the response result aiming at the voice, so that the voice control method based on the image recognition is realized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram in which the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a voice control method of the present disclosure;

FIG. 3 is yet another schematic flow diagram of image recognition;

FIG. 4 is a schematic diagram of an application scenario of a voice control method of an embodiment of the present disclosure;

FIG. 5 is a flow chart of yet another embodiment of a voice control method of the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of a voice control apparatus of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the voice control method or voice control apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may interact with a server 105 via a network 104 to receive or transmit various information (e.g., voice, images, etc.), and the like. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with voice collecting devices. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the voice control method provided by the embodiment of the present disclosure is generally executed by the

terminal devices

101, 102, and 103, and accordingly, the voice control apparatus is generally disposed in the

terminal devices

101, 102, and 103. In some cases, exemplary system architecture 100 may not include network 104 and server 105.

It should be noted that, in some cases, the server 105 may obtain the speech acquired by the

terminal devices

101, 102, 103 using the speech acquisition device and the screenshot of the current page when the speech is acquired, perform speech recognition on the speech to obtain a speech recognition result, determine an image recognition result of the current page using the screenshot, then may search for a keyword matching the speech recognition result in the image recognition result, and perform a simulation operation at a position where the searched keyword is located in the page to generate a response result for the speech, and further may return the generated response result to the

terminal devices

101, 102, 103. At this time, the voice control method provided by the embodiment of the present disclosure may be executed by the server 105, and accordingly, the voice control apparatus may be provided in the server 105.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a voice control method according to the present disclosure is shown. The voice control method comprises the following steps:

step 201, in response to receiving the voice, acquiring a screenshot of the current page.

In this step, the execution subject may receive the voice by using various voice collecting devices (such as a microphone, etc.), or may obtain the voice from other storage devices or a third-party data platform, etc. The speech may be of various types. For example, the voice may be a user voice transmitted by the user according to the actual application requirement, and in this case, the user voice may be used as a voice control instruction to indicate the user requirement.

The current page may refer to a page presented by a terminal device (e.g.,

terminal devices

101, 102, 103, etc. shown in fig. 1) used by the user when the terminal device receives speech. Thus, the screenshot of the current page may refer to a screenshot of a page presented by the terminal device used by the user when the terminal device received the speech. The executive agent may use various methods to obtain the screenshot of the current page. For example, the screen shot of the current page may be captured by an application installed in the terminal device used by the user for screen shot or by calling a screen shot function provided by the operating system.

Step 202, performing voice recognition on the voice to obtain a voice recognition result.

In this step, various speech recognition methods can be used to perform speech recognition on the speech to obtain a corresponding speech recognition result. For example, linguistic and acoustic based speech recognition methods, random modeling methods, artificial neural network based speech recognition methods, probabilistic parsing, and the like.

The content included in the speech recognition result can be flexibly set according to the actual application requirements. For example, the speech recognition result may include text corresponding to the speech. For another example, the speech recognition result may include keywords included in text corresponding to the speech.

And step 203, determining an image recognition result of the page by using the screenshot.

In this step, the image recognition result of the page may include a keyword and a location. The keywords may include keywords in the page, and the position may include a position in the page where the keywords correspond to.

Specifically, the keywords in the page may refer to keywords corresponding to texts in the page. A text may consist of one or more words. Thus, a text may be a word, a sentence, a paragraph, etc. One or more texts may be included in each page. Each text may correspond to one or more keywords. The keyword corresponding to each text may be the text itself, or may be a keyword extracted from the text.

The dividing mode of the text specifically included in each page may be preset by a technician according to actual application requirements. For example, the text may be divided according to the control areas of the page, so that words in each control area are combined into one text. Wherein, the control area in the page may refer to an area capable of responding to a user operation. For another example, the content in each text tag (e.g., title tag < h1> </h >, paragraph tag < P > </P >, etc.) in the page can be treated as one text according to the design of the page. Different pages may have the same text or different text.

As an example, a video call page includes four control areas. The first control area is used for controlling the use of the front camera or the rear camera, and the text turning is included in the control area. The second control area is used to control whether the background of the current picture is blurred or not, and the control area includes the text "blurred background". The third control area is used for controlling whether to switch to the voice call, and the control area comprises a text 'switch voice call'. The fourth control area is used for controlling whether the current video call is hung up or not. At this time, the page may include four texts included in the four control regions, and each text may be directly used as a keyword.

The position of the keyword in the page can be represented by adopting various position marking methods. For example, the representation may be performed using coordinates under a preset coordinate axis.

Generally, the screenshot of the current page is stored, i.e. the content of the current page. Therefore, the image recognition result of the screenshot can be regarded as the image recognition result of the current page. Specifically, various methods may be adopted to determine the image recognition result corresponding to the screenshot, so as to obtain the image recognition result of the current page. For example, the image recognition results corresponding to the screenshots of the pages may be stored in advance, and at this time, the image recognition result corresponding to the screenshot of the current page may be directly searched.

It should be noted that the execution sequence of the

above steps

201 and 203 may be flexibly set, and may also be executed in parallel.

And step 204, searching keywords matched with the voice recognition result in the image recognition result as target keywords.

In this step, the matching degree between the speech recognition result and each keyword in the image recognition result may be determined, and then the keyword with the largest matching degree may be selected as the keyword matched with the speech recognition result. The matching degree between the voice recognition result and the keyword can be determined by various methods. For example, a text similarity or a feature similarity or the like of the text included in the speech recognition result and the keyword may be determined as a matching degree between the two.

Step 205, performing simulation operation at the position of the target keyword in the page to generate a voice response result.

In this step, a simulation operation may be performed at the position of the target keyword in the page to obtain a response result of the voice. The simulation operation may refer to various user operations. Such as a click operation, a slide operation, and the like. The response result may refer to an operation result obtained by the simulation operation. The simulation operation can be determined in various ways. For example, the corresponding relationship between the keyword and the simulation operation may be preset, and at this time, the simulation operation corresponding to the target keyword may be queried by using the preset corresponding relationship.

The existing voice control method usually needs to label each operable control (such as a button) on each page in advance, so that when voice is received, a voice matching label is determined by matching a voice recognition result of the voice with a label corresponding to a current page, and then corresponding operation is performed on the operable control corresponding to the matched label, so as to complete response to the voice.

However, when the number of the pages and the operable controls on the pages is large, the cost for pre-labeling the pages is high, the problems of label missing and the like are easily caused, and when the pages are changed, the labels need to be readjusted to adapt to the changes of the pages. In addition, in some cases, if the user adjusts the page display language, the language of the callout and the language of the page display may not coincide, resulting in the unavailability of the pre-callout. In addition, in some cases, the operability control corresponding to the determined label matching the voice may be in an invisible state in the current page, and a response result not corresponding to the current page may appear at this time.

The method provided by the embodiment of the disclosure is a voice control method based on image recognition, only image recognition needs to be performed on the screenshot of the page when the voice is received, no pre-labeling is needed on the operable control in the page, the display language used by the page is not related, and in addition, image recognition is performed on the screenshot of the current page, the obtained image recognition result is necessarily the content displayed in the current page, the invisible operable control in the current page can be prevented from being matched, and therefore the generation of a response result which is not corresponding to the current page is avoided.

In some optional implementation manners of this embodiment, various image recognition methods may be used to perform image recognition on the screenshot of the current page to obtain an image recognition result. For example, artificial intelligence based image recognition methods, etc.

Optionally, the image recognition result may be obtained by performing image recognition on the screenshot of the current page through the following steps:

step one, preprocessing a screenshot of a current page to obtain a corresponding binary image.

In this step, the screenshot of the current page may be preprocessed to obtain a binary image corresponding to the screenshot of the current page. The preprocessing may include binarization processing, among others.

The pre-processing may also include other various image processing depending on the actual application scenario. For example, the pre-processing may also include, but is not limited to, at least one of: graying processing, noise reduction processing, and the like.

And step two, determining a character area in the binary image and the position of the character area in the binary image, and extracting the characteristics of the character area.

In this step, the character area may refer to an area where the character is located. Each character area may include one character, or may include more than two characters, and may be flexibly set according to an actual application scenario.

Specifically, each character region included in the binarized image and the position of each character region in the binarized image may be determined first using various image recognition methods. For example, OCR (Optical Character Recognition), deep learning based image Recognition, and the like.

For each character region in the binarized image, various feature extraction methods can be employed to extract the features of the character region. For example, feature extraction based on neural network models.

And step three, determining characters corresponding to the characteristics of the character area as keywords by using a preset characteristic library, and determining the keywords and the positions corresponding to the character area as image recognition results.

In this step, the preset feature library may be used to store the feature corresponding to each character and the corresponding relationship between the characters. For each character area in the binary image, the character corresponding to the feature of the character area can be searched in a preset feature library to be used as a keyword, and the image recognition result of the character area is obtained by combining the position of the character area in the binary image. Then, keywords respectively corresponding to each character area of the binarized image and positions in the binarized image can be determined as image recognition results of the screenshot of the current page. The preset feature library may be previously developed and set by a technician.

By way of example, reference may be made to fig. 3, fig. 3 being a schematic flow 300 of image recognition. The image recognition process 300 includes the following steps:

step 301, performing graying processing on the screenshot of the current page to obtain a grayscale image.

In this step, the current screenshot of the page may be grayed by using various graying methods. Because the character color in the screenshot of the current page basically has no influence on the character content, the color information can be removed through the graying processing so as to reduce the subsequent image identification calculation amount.

And step 302, performing binarization processing on the gray level image to obtain a binarized image.

In this step, various existing binarization processing methods can be adopted to perform binarization processing on the grayscale image. Because the characters are basically all formed by strokes and the weight of the strokes has no influence on the content of the characters, the character information and the non-character information in the screenshot of the current page can be distinguished through binarization processing, so that the non-character information can be removed in subsequent processing, and the calculation amount of image recognition is further reduced.

And 303, performing noise reduction processing on the binary image to obtain a noise-reduced binary image.

In this step, various existing image denoising methods can be adopted to perform denoising processing on the binarized image. Because some image noise points are usually left in the image in the binarization processing, the quality of the binarization image can be improved through the noise reduction processing, and the influence of the image noise points on the subsequent image identification effect is avoided.

Since image noise generally comes from non-character information in an image, as an example, each connected region (e.g., a region with a value of 1) in a binarized image may be searched by a BFS (break First Search) method or the like, and then an average value of pixel areas of each connected region may be determined. For each communication region, if the pixel area of the communication region is far smaller than the obtained average value, the communication region can be determined as an image noise point, and then the value of the pixel point in the communication region can be adjusted to make the communication region a non-character region (for example, the pixel value is set to be 0), so that the noise reduction processing of the binary image is realized.

And step 304, determining a character area in the denoised binary image.

In this step, since the denoised binarized image has been distinguished between character information and non-character information in the image, the denoised binarized image may be scanned (e.g., in the length and width directions of the image, respectively), and the scanned periodically appearing non-character information may be regarded as intervals between characters, thereby determining each character region according to the character intervals.

For each character region, the features of the character region are extracted, step 305.

In this step, the feature of each character region may be extracted based on the elastic mesh. For example, "M" equally dividing the width direction of the denoised binarized image and "N" equally dividing the length direction of the denoised binarized image may be performed first, so that "M × N" networks may be obtained. Wherein, "M" and "N" may be positive integers, and may be equal or unequal. Then, for each character region, the pixel density of the character region in each grid can be counted, and a feature vector is formed by the pixel densities to represent the feature of the character region. The feature extraction method based on the elastic grid has strong anti-interference performance and can be compatible with different font types, thereby being beneficial to improving the subsequent image identification effect.

And step 306, determining characters corresponding to the characteristics of the character area as keywords by using a preset characteristic library, and determining the keywords and the positions corresponding to the character area as image recognition results.

The method is beneficial to improving the image recognition efficiency and the image recognition effect by firstly carrying out graying, binarization, noise reduction and other processing on the screenshot of the current page and then carrying out image recognition based on the elastic grid and other feature extraction methods. And when the voice is received, the screenshot of the current page is subjected to real-time image recognition, so that accurate recognition of the dynamic page can be ensured.

In some optional implementations of the present embodiment, the speech recognition result of the speech may include a user intention. The user intent may include keywords and user actions. In this case, the user intention may indicate a user operation that the user desires to perform and a keyword corresponding to an operation target of the user operation.

At this time, a keyword matching the keyword included in the user intention may be searched for in the image recognition result as the target keyword. In addition, in this case, the simulation operation may be used to simulate a user operation included in the user intention.

Since natural language processing is a mature technology at present, the natural language processing is combined to analyze the user intention and keyword matching, which is helpful to ensure the accuracy of the voice control result.

With continued reference to fig. 4, fig. 4 is an illustrative application scenario 400 of the speech control method according to the present embodiment. In the application scenario of fig. 4, a user inputs a voice 402 "next page" to the mobile phone 401 used by the user, at this time, the mobile phone 401 may first obtain a screenshot 403 of a current page, and perform image recognition on the screenshot 403 to obtain an image recognition result 404, where the image recognition result 404 includes keywords in the screenshot 403 and a position of each keyword in the screenshot 403. Specifically, as shown in the figure, the keywords in the screenshot 403 include the article, the previous page, and the next page browsed by the user, and the corresponding coordinates are "X1, Y1", "X2, Y2", and "X3, Y3", respectively. In addition, the mobile phone 401 may perform voice recognition on the voice 402 to obtain a keyword "next page" and a corresponding user operation "click", and then may determine that the voice 402 corresponds to the keyword "next page" in the screenshot 403 through matching, so that a response to the voice 402 may be implemented by simulating clicking "next page" at "X3, Y3" in the current page.

With further reference to fig. 5, a flow 500 of yet another embodiment of a voice control method is shown. The flow 500 of the voice control method includes the following steps:

step 501, responding to the received voice, acquiring a screenshot of the current page.

Step 502, performing voice recognition on the voice to obtain a voice recognition result.

Step 503, determining an image recognition result corresponding to the screenshot.

Step 504, searching keywords matched with the voice recognition result in the image recognition result as target keywords.

And 505, acquiring a preset keyword corresponding to the position of the target keyword.

In this embodiment, a corresponding preset keyword may be set in advance for each position in each page of the terminal device used by the user. Specifically, the preset keywords may be set according to the content selection position actually presented by each page. The number of the selected positions and the number of the preset keywords corresponding to each position can be flexibly determined according to an actual application scene. The content of the preset keywords can also be flexibly determined according to the actual application scene.

For example, one or more preset keywords may be set for each page according to the function provided by the content in the page. At this time, any position in the page corresponds to the preset keyword, and the preset keyword may represent a function provided by content presented by the page. For another example, a corresponding preset keyword may be set for each operable position in each page, and the preset keyword may represent a function provided at each operable position.

The execution main body can search the preset keywords corresponding to the positions of the target keywords in the page of the terminal equipment used by the user from the local or other storage equipment.

Step 506, determining the similarity between the target keyword and the preset keyword.

In this embodiment, various similarity calculation methods may be employed to determine the similarity between the target keyword and the preset keyword. For example, a similarity calculation method based on a distance (e.g., euclidean distance, manhattan distance, etc.), a similarity calculation method based on pearson correlation, and the like.

And 507, performing simulation operation at the position of the target keyword in the page to generate a voice response result in response to the fact that the similarity is larger than the preset similarity threshold.

In this embodiment, the preset similarity threshold may be preset by a technician according to an actual application scenario. After the similarity between the target keyword and the preset keyword is obtained, whether the similarity is greater than a preset similarity threshold value or not can be determined. If the target keyword is larger than the preset threshold, simulation operation can be carried out at the position of the target keyword in the page to generate a response result of voice.

And step 508, generating prompt information in response to the fact that the similarity is not larger than the preset similarity threshold.

In this embodiment, if not, a prompt message may be generated. Wherein the prompt message may be used to prompt re-receiving speech. At this time, the user can retransmit the voice according to the prompt information. The retransmitted speech may or may not be the same as the previously transmitted speech.

As an example, a keyword for describing a function of each page may be preset as a preset keyword, and at this time, if the determined target keyword is not associated with the function of the corresponding page very much, it may be determined that the voice has an error or the voice recognition result may be wrong, so as to prompt the user in time to help improve the user experience of using the voice control.

The execution process not specifically described in this embodiment may refer to the related description in the corresponding embodiment of fig. 2, and is not repeated herein.

In some cases (such as when the user is eating), the quality of the received voice may be poor, or the voice control instruction indicated by the voice does not correspond to the current page, which may cause the determined target keyword to be inaccurate, thereby affecting the accuracy of the subsequently generated response result. Under the conditions, a plurality of preset keywords are preset for each page, after the target keywords are determined based on image recognition, the similarity between the target keywords and the corresponding preset keywords is further determined, whether the target keywords are used for determining the voice response result is determined based on the similarity, the determined target keywords can be verified by using the preset keywords, and the accuracy of the generated response result is further ensured.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a voice control apparatus, which corresponds to the method embodiment shown in fig. 2, and which can be specifically applied to various electronic devices.

As shown in fig. 6, the voice control apparatus 600 provided in this embodiment includes a screenshot capture module 601, a voice recognition module 602, an image recognition module 603, a matching module 604, and a response module 605. Wherein the screenshot obtaining module 601 is configured to obtain a screenshot of a current page in response to receiving a voice; the speech recognition module 602 is configured to perform speech recognition on the speech to obtain a speech recognition result; the image recognition module 603 is configured to determine an image recognition result of the page using the screenshot, wherein the image recognition result includes a keyword and a location, the keyword includes the keyword in the page, and the location includes a location of the keyword in the page; the matching module 604 is configured to find a keyword matching the voice recognition result in the image recognition result as a target keyword; the response module 605 is configured to perform a simulation operation at a position in the page where the target keyword is located, and generate a response result of the voice.

In the present embodiment, voice control apparatus 600: the detailed processing and the technical effects of the screenshot obtaining module 601, the voice recognition module 602, the image recognition module 603, the matching module 604 and the response module 605 can refer to the related descriptions of step 201 and step 205 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the image recognition module 603 is further configured to: and carrying out image recognition on the screenshot to obtain an image recognition result.

In some optional implementations of this embodiment, the image recognition module 603 is further configured to: preprocessing the screenshot to obtain a corresponding binaryzation image, wherein the preprocessing comprises binaryzation; determining a character area in the binary image and the position of the character area in the binary image, and extracting the characteristics of the character area; and determining characters corresponding to the characteristics of the character area as keywords by using a preset characteristic library, and determining the keywords and the positions corresponding to the character area as image recognition results, wherein the preset characteristic library is used for storing the corresponding relation between the characteristics of the character area and the characters.

In some optional implementations of the embodiment, the voice recognition result includes a user intention, where the user intention includes a keyword and a user operation; and the matching module 604 is further configured to: searching keywords matched with the keywords included in the user intention in the image recognition result to be used as target keywords; and the simulation operation is used for simulating the user operation included by the user intention.

In some optional implementations of this embodiment, the voice control apparatus 600 further includes: the preset keyword obtaining module (not shown in the figure) is configured to obtain a preset keyword corresponding to a position where the target keyword is located; a similarity determination module (not shown in the figure) configured to determine similarity of the target keyword and a preset keyword; and the response module 605 is further configured to: in response to the fact that the similarity is larger than a preset similarity threshold value, performing simulation operation at the position where the target keyword is located in the page, and generating a voice response result; and generating prompt information in response to the fact that the similarity is not larger than the preset similarity threshold, wherein the prompt information is used for prompting to receive the voice again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the voice control method. For example, in some embodiments, the voice control method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the speech control method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the voice control method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel or sequentially or in a different order, as long as the desired results of the technical solutions provided by this disclosure can be achieved, and are not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of voice control, comprising:

responding to the received voice, and acquiring a screenshot of the current page;

carrying out voice recognition on the voice to obtain a voice recognition result;

determining an image recognition result of the page by using the screenshot, wherein the image recognition result comprises a keyword and a position, the keyword comprises the keyword in the page, and the position comprises the position of the keyword in the page;

searching a keyword matched with the voice recognition result in the image recognition result as a target keyword;

and performing simulation operation at the position of the target keyword in the page to generate a response result of the voice.

2. The method of claim 1, wherein said determining an image recognition result of the page using the screenshot comprises:

and carrying out image recognition on the screenshot to obtain an image recognition result.

3. The method of claim 2, wherein the image recognition of the screenshot to obtain an image recognition result comprises:

preprocessing the screenshot to obtain a corresponding binaryzation image, wherein the preprocessing comprises binaryzation;

determining a character area in a binary image and the position of the character area in the binary image, and extracting the characteristics of the character area;

and determining characters corresponding to the characteristics of the character area as keywords by using a preset characteristic library, and determining the keywords and the positions corresponding to the character area as image recognition results, wherein the preset characteristic library is used for storing the corresponding relation between the characteristics of the character area and the characters.

4. The method of claim 1, wherein the speech recognition result comprises a user intent, wherein the user intent comprises a keyword and a user operation; and

the step of searching the keywords matched with the voice recognition result in the image recognition result as target keywords comprises the following steps:

searching keywords matched with the keywords included in the user intention in the image recognition result to serve as target keywords; and

the simulation operation is used for simulating the user operation included in the user intention.

5. The method according to one of claims 1-4, wherein the method further comprises:

acquiring a preset keyword corresponding to the position of the target keyword;

determining the similarity between the target keyword and the preset keyword; and

the simulation operation is carried out at the position of the target keyword in the page to generate the response result of the voice, and the method comprises the following steps:

in response to the fact that the similarity is larger than a preset similarity threshold value, performing simulation operation at the position of the target keyword in the page to generate a response result of the voice; and

the method further comprises the following steps:

and generating prompt information in response to the fact that the similarity is not larger than a preset similarity threshold, wherein the prompt information is used for prompting to receive the voice again.

6. A voice control apparatus comprising:

a screenshot obtaining module configured to obtain a screenshot of a current page in response to receiving the voice;

the voice recognition module is configured to perform voice recognition on the voice to obtain a voice recognition result;

an image recognition module configured to determine an image recognition result of the page using the screenshot, wherein the image recognition result includes a keyword and a location, the keyword includes the keyword in the page, and the location includes the location of the keyword in the page;

a matching module configured to search the image recognition result for a keyword matched with the voice recognition result as a target keyword;

and the response module is configured to perform simulation operation at the position of the target keyword in the page to generate a response result of the voice.

7. The apparatus of claim 6, wherein the image recognition module is further configured to:

8. The apparatus of claim 7, wherein the image recognition module is further configured to:

9. The apparatus of claim 6, wherein the speech recognition result comprises a user intent, wherein the user intent comprises a keyword and a user operation; and

the matching module is further configured to:

searching keywords matched with the keywords included in the user intention in the image recognition result to be used as target keywords; and

10. The apparatus according to one of claims 6-9, wherein the apparatus further comprises:

the preset keyword acquisition module is configured to acquire a preset keyword corresponding to the position of the target keyword;

a similarity determination module configured to determine a similarity of the target keyword and the preset keyword; and

the response module is further configured to:

in response to the fact that the similarity is larger than a preset similarity threshold value, performing simulation operation at the position of the target keyword in the page to generate a response result of the voice;

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.