CN114625297A

CN114625297A - Interaction method, device, equipment and storage medium

Info

Publication number: CN114625297A
Application number: CN202210254315.3A
Authority: CN
Inventors: 黄荣升; 牛飞; 王加锋; 高磊磊; 陈轶博; 张岩
Original assignee: Shanghai Xiaodu Technology Co Ltd
Current assignee: Shanghai Xiaodu Technology Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-14

Abstract

The present disclosure provides an interaction method, an interaction apparatus, an interaction device, and a storage medium, which relate to the technical field of artificial intelligence, in particular to the technical field of computer vision, speech recognition and deep learning, and are applicable to an intelligent interaction scenario. The specific implementation scheme is as follows: if the interaction requirement is detected, analyzing the selectable elements contained in the current display interface, and determining the analysis result of the selectable elements, wherein the analysis result comprises text information and/or the display positions of the selectable elements on the current display interface; and determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element, processing the interactive resource corresponding to the target element, and outputting a processing result. The intelligent effect of the intelligent screen interaction equipment can be improved.

Description

Interaction method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision, speech recognition and deep learning techniques, applicable to intelligent interactive scenes.

Background

With the development of artificial intelligence technology, intelligent interactive equipment with a screen, which can support voice interaction, is gradually emerging, and the interactive equipment can present rich and colorful resources such as pictures, videos or characters and the like by virtue of screen advantages. However, the intelligent effect of the current intelligent interactive device with a screen is not ideal, and improvement is needed urgently.

Disclosure of Invention

The disclosure provides an interaction method, an interaction device, an interaction equipment and a storage medium.

According to an aspect of the present disclosure, there is provided an interaction method, including:

if the interaction requirement is detected, analyzing the selectable elements contained in the current display interface, and determining the analysis result of the selectable elements, wherein the analysis result comprises text information and/or the display positions of the selectable elements on the current display interface;

determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element;

and processing the interactive resources corresponding to the target elements and outputting a processing result.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the interaction method of any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the interaction method of any one of the embodiments of the present disclosure.

According to the scheme of the embodiment of the invention, the intelligent effect of the intelligent screen-based interactive device can be improved, and a new solution is provided for man-machine interaction based on the intelligent screen-based interactive device.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1A is a flow chart of an interaction method provided according to an embodiment of the present disclosure;

FIG. 1B is a schematic diagram of a current presentation interface provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a flow chart of an interaction method provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a flow chart of an interaction method provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a flow chart of an interaction method provided in accordance with an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an interaction device provided in accordance with an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing the interaction method of the disclosed embodiments.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1A is a flowchart of an interaction method provided according to an embodiment of the present disclosure, and fig. 1B is a schematic diagram of a current presentation interface provided according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the condition of human-computer interaction based on intelligent screen-equipped interaction equipment. The method is particularly suitable for the situation that an interactive object (such as a user) performs man-machine interaction with the interactive device based on the screen display content of the interactive device. The method may be performed by an interactive device, which may be implemented in software and/or hardware. The method can be specifically integrated into intelligent screen interaction equipment or cloud server equipment. As shown in fig. 1A-1B, the interaction method provided by this embodiment may include:

s101, if the interaction requirement is detected, analyzing the selectable elements contained in the current display interface, and determining the analysis result of the selectable elements.

The interaction requirement may be a requirement that the interaction object has human-computer interaction with the intelligent on-screen interaction device, for example, the interaction object wants to view resources on the intelligent on-screen interaction device, that is, the interaction requirement exists. The currently presented interface may be an interface presented on a display screen of the interaction device at the moment the interaction need exists.

The current display interface can contain one or more resources, and for each resource, all elements corresponding to the resource can be displayed to the interactive object through the current display interface, so that the interactive object and the interactive device perform intelligent interaction based on the resource. The elements involved in the present embodiment may include both image elements and text elements. The text element may be an element displayed in a literal form; the image element may be an element presented in the form of a picture; specifically, the image element further includes: icon elements and resource description image elements. For example, for a video resource, its corresponding elements in the resource search interface may include a video cover page, a video name, a video vignette, and so on. The elements contained in the resource playing interface can include: the operation icons (such as pause, fast forward, volume adjustment and the like) correspond to characters and the like. The selectable elements of this embodiment may be all elements presented to the interactive object corresponding to the resources included in the current presentation interface. Accordingly, the selectable elements may include: text elements and/or image elements.

Illustratively, as shown in fig. 1B, the current presentation interface includes three video resources, namely an a animation resource, a B animation resource, and a C animation resource. Wherein, the optional elements corresponding to the animation resources A comprise:

boxes

10 and 18 of the text class and boxes 15 of the image class. The optional elements corresponding to the animation resources B comprise:

boxes

10 and 19 of the text class and boxes 16 of the image class. The optional elements corresponding to the C animation resource comprise:

boxes

10 and 20 of the text class and boxes 17 of the image class.

The analysis result of the selectable element comprises text information and/or the display position of the selectable element on the current display interface. The text information of the selectable element may be text information corresponding to the selectable element, or semantic information obtained by performing semantic analysis on the text information corresponding to the selectable element.

It should be noted that, in this embodiment, if the text information parsed by the selectable element includes multiple sub-results, the display position of the selectable element may be a display position shared by multiple sub-results, or a display position of each sub-result may be reserved.

For example, as shown in fig. 1B, the text information corresponding to the box 16 (i.e. the optional element) may include: "1.1 epilog" and "machine cat". The display position of the box 16 may be the position coordinate of the box 16 in the current interface, or may be the position coordinate of the words "1.1 on the map" in the current display interface and the position coordinate of the "machine cat" pattern in the current display interface.

Optionally, in this embodiment, there are many ways to detect the interaction requirement, which may include but are not limited to: the first implementation mode is that the interactive equipment is detected to enter the working state. Specifically, it may be detected that the intelligent interactive device with the screen is awakened from a sleep state (e.g., awakened by an awakening word, a gesture, or a key), and enters a working state; the method can also be used for detecting that the intelligent interactive equipment with the screen is started from a shutdown state and enters a working state. And the second implementation mode is that voice information is received. Specifically, the interactive voice sent by the interactive object may be received by a sound receiving device (such as a microphone). Such as "turn on tv live" when the interactive object is received. And in the third implementation mode, the distance between the detected object and the interactive object is smaller than the set distance. Specifically, it may be detected that an interactive object exists around the interactive device through a distance sensor or a camera mounted on the interactive device, and based on that the distance between the interactive objects is smaller than a set distance, it indicates that the interactive object is approaching the interactive device at this time, and may perform human-computer interaction with the interactive device, and at this time, it may be considered that an interaction demand exists. The embodiment can flexibly detect the interaction requirement in various modes, and can improve the comprehensiveness and accuracy of the detection of the interaction requirement. It should be noted that, in this embodiment, the operation of detecting the interaction requirement is usually detected by an intelligent interactive device with a screen.

Optionally, if the execution main body of this embodiment is an interactive device, when the interactive device detects an interaction requirement in any of the above manners, the interactive device may directly obtain a current display interface on a display screen of the local device, and perform analysis on text information and/or a display position of a selectable element included in the current display interface; if the execution main body of this embodiment is the cloud server, when the interaction device detects the interaction demand in any manner, the current display interface (such as a screenshot image of the current display interface) on the display screen of the local device may be acquired and sent to the cloud server, and the cloud server performs analysis on the text information and/or the display position of the selectable element based on the received current display interface (such as a screenshot image of the current display interface).

Optionally, in this embodiment, one implementation manner of analyzing the selectable elements included in the current display interface may be: acquiring a current display interface (such as a screen shot image of the current display interface), extracting all selectable elements contained in the current display interface based on an element object detection algorithm, and then, if the text information needs to be analyzed, further analyzing a display area of the selectable elements in the current display interface to determine the text information of the selectable elements; if the display position needs to be analyzed, the display area of the selectable element in the current display interface can be directly used as the display position of the selectable element, or the display area of each sub-result in the text information analyzed by the selectable element can be used as a sub-display position of the selectable element. Specifically, the display area may be used as the whole display position, and a plurality of position coordinates (such as center point coordinates or corner point coordinates) in the display area may also be used as the display position of the selectable element.

Another way to implement this is: the method comprises the steps of obtaining a current display interface (such as a screen shot image of the current display interface), and identifying analysis results corresponding to all selectable elements contained in the current display interface based on a pre-trained element detection model. The two realizations can be combined to more accurately and comprehensively analyze the text information and/or the display position corresponding to all selectable elements contained in the current display interface. The analysis may be performed in other manners, and this embodiment is not limited to this.

And S102, determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element.

The voice information of this embodiment may be interactive voice sent by an interactive object received by an interactive device when the interactive object performs human-computer interaction with the interactive device. The target element refers to an optional element associated with the voice information, and may be, for example, an optional element described by the voice information.

Optionally, if the execution main body of this embodiment is an interactive device, after receiving the voice information of the interactive object, the interactive device may determine the target element for the received voice information directly based on the text information and the display position of the optional element analyzed in S101; if the execution main body of the embodiment is the cloud server, after receiving the voice information of the interaction object, the interaction device may send the voice information to the cloud server, and the cloud server determines the target element for the received voice information based on the text information and the display position of the optional element analyzed in S101.

Specifically, in this embodiment, according to the received voice information and the analysis result of the optional element, the manner of determining the target element corresponding to the voice information may be: firstly, semantic parsing is carried out on received voice information based on a natural language recognition technology, such as a natural language processing algorithm, or a pre-trained semantic recognition model, interaction requirements corresponding to the voice information are determined, then the interaction requirements are respectively matched with the text information and/or the display positions of each optional element parsed in the S101, and optional elements, namely target elements, conforming to the interaction requirements are determined. If the optional elements corresponding to the text information conforming to the interaction requirement are determined as the target elements, or the optional elements corresponding to the display positions conforming to the interaction requirement are determined as the target elements; and optional elements which determine the text information and the display position and simultaneously conform to the interaction requirement can be used as target elements.

For example, as shown in fig. 1B, if the received voice message is "i want to see the machine cat", the voice message is analyzed to determine that the interaction requirement is "request the resource related to the machine cat", at this time, the text information of each optional element may be matched with the interaction requirement, and the target element is determined as block 16. If the received voice message is "i want to see the first animation", the voice message is analyzed to determine that the interaction requirement is "the on-demand position is at the first resource", at this time, the display position of each optional element can be matched with the interaction requirement, and the target element is determined to be the block 15. If the received voice message is "i want to see the first animation a", the voice message analysis determines that the interaction requirement is "animation a at the first position at the on-demand position", at this time, the text information and the display position of each selectable element may be matched with the interaction requirement, and the target element is determined to be the block 15 and the block 18.

Optionally, in this embodiment, after the interaction requirement corresponding to the voice information is matched with the analysis result of the optional element, a plurality of target elements are determined, whether the plurality of target elements are associated with the same resource may be determined according to a position relationship between the plurality of target elements, if yes, the plurality of target elements are reserved, and if not, the plurality of target elements need to be screened, so that the finally reserved target elements are associated with the same resource.

For example, as shown in fig. 1B, if the target elements are the block 15 and the block 18, at this time, since the two target elements correspond to the same resource, both the block 15 and the block 18 may be reserved; if the target elements are the

blocks

15 and 16, then the two target elements are further filtered because they correspond to different resources.

S103, processing the interactive resources corresponding to the target elements and outputting the processing result.

In this embodiment, for one target element, the corresponding interaction resource may be one or more. For example, if the target element is a description text or an image of a certain resource, as shown in block 15-block 20 in fig. 1B, the interactive resource corresponding to the target element is one, and if the target element is a description text or an image of a certain type of resource, as shown in block 10-block 14 in fig. 1B, the interactive resource corresponding to the target element may include multiple resources in the type.

Optionally, after determining the target element corresponding to the voice information, this embodiment may obtain at least one resource corresponding to the target element as an interactive resource according to a corresponding relationship between the target element and the resource, and then process the obtained interactive resource based on an interaction requirement corresponding to the voice information to obtain a processing result, and output the processing result. Specifically, the processing of the acquired interaction resource may include, but is not limited to: playing the interactive resources, adjusting the playing state of the interactive resources, integrating the elements of a plurality of resources, and the like.

According to the scheme of the embodiment, when the human-computer interaction requirement exists, the text information and/or the display position of the optional elements contained in the current display interface are analyzed in real time, the target elements matched with the received voice information are determined based on the real-time analysis result, then the interaction resources corresponding to the target elements are processed, and the processing result is output. According to the scheme, the interactive resources corresponding to the voice information are matched based on the analysis result of the optional elements in the current display interface during human-computer interaction, so that even if the voice information sent by the interactive object is not the accurate interactive resource name but the description text or the position of the corresponding element of the interactive resource, the interactive resources can be accurately positioned, for example, aiming at FIG. 1B, a user does not need to accurately speak the B animation, the corresponding resource of the B animation can be accurately positioned only by describing a machine cat, and the interaction effect that the user can speak instantly can be achieved. The intelligent effect of the intelligent screen interaction equipment is greatly improved. In addition, it should be further noted that, in this embodiment, the analysis result of the optional element included in the current display page is not configured in advance, but is obtained by automatically analyzing the current display interface in real time when an interaction requirement exists. According to the method and the device, the analysis result does not need to be manually configured for the optional elements contained in the current display page in advance, the interaction cost is greatly reduced, and in addition, the defect that the display boundary of the third-party application program cannot be analyzed in advance due to the fact that the display interface of the third-party application program cannot be obtained can be overcome, and therefore human-computer interaction based on the third-party application program cannot be carried out.

Optionally, in this embodiment, when determining the target element corresponding to the voice information according to the received voice information and the analysis result of the optional element, a situation that the voice information does not have a matched optional element may exist, that is, a situation that the voice information does not have a corresponding target element may exist, and at this time, the scheme of this embodiment may further include: and if the voice information does not have the corresponding target element, outputting an interaction result corresponding to the voice information according to a preset interaction strategy. The preset interaction policy may be an interaction policy set in advance for different types of interaction requirements, or a preset certain fixed result recall policy, and the like. According to the scheme, after the voice information is received, the scheme of the embodiment is considered to be adopted to respond to the voice information, and if the target element corresponding to the voice information is not found, the recall result corresponding to the voice information is determined by considering the preset interaction strategy. The advantage of setting up like this can avoid because of not having the target element that matches, leading to the condition that interactive equipment does not have the output to appear, further improves the flexibility and the intelligence of interactive result output.

Fig. 2 is a schematic diagram of an interaction method according to an embodiment of the present disclosure, and on the basis of the above embodiment, the embodiment of the present disclosure further explains in detail how to analyze selectable elements included in a current presentation interface and determine an analysis result of the selectable elements, as shown in fig. 2, the interaction method provided in this embodiment may include:

s201, if the interaction requirement is detected, determining the types of the selectable elements contained in the current display interface.

The types of the selectable elements in this embodiment may include an image type and a text type, and for the image type, the method may further include: icon type and resource description picture type.

The embodiment can analyze the element types corresponding to the optional elements contained in the current display page based on the characteristics of the optional elements of different types; and analyzing each optional element contained in the current display interface through a pre-trained type analysis model to determine the type of each optional element.

It should be noted that, in this embodiment, the types of the selectable elements included in the current presentation interface may be the same or different. Typically, the types of selectable elements contained in the current presentation interface are different. For example, there are some selectable elements that are of an image type and some selectable elements that are of a text type.

And S202, analyzing the different types of optional elements by adopting different types of analysis modes to obtain analysis results of the different types of optional elements.

Optionally, the present embodiment may allocate different element parsing manners for different types of optional elements. Specifically, for the optional elements of the text type included in the current presentation interface, determining a text Recognition algorithm analysis mode, for example, an Optical Character Recognition (OCR) analysis mode; and determining a deep learning analysis mode for the optional elements of the image type, such as a mode of analyzing through a pre-trained target detection model.

Further, if the selectable elements of the image type are subjected to more detailed dimension type division, the embodiment may also adopt a deep learning parsing manner for the resource image description type. For the optional elements of the icon types, analyzing the optional elements by adopting an analysis result template corresponding to various types of icons which is configured in advance, namely the corresponding analysis mode is an icon template analysis mode.

Optionally, in this embodiment, each selectable element included in the current display page may be sequentially analyzed based on the corresponding element analysis mode, or different element analysis modes may be invoked in parallel for different types of selectable elements to analyze different types of selectable elements.

Specifically, for the optional element of the text type, a text analysis algorithm may be invoked based on an analysis mode of a text recognition algorithm, for example, the OCR algorithm is invoked to perform text recognition on the optional element, so as to recognize all the characters included in the optional element and the position coordinates of each character on the current display interface. When the text information in the analysis result is determined, all recognized characters can be directly used as the text information, and/or semantic recognition is carried out on all recognized characters, and the obtained semantic recognition result is used as the text information. When determining the display position in the analysis result, the position coordinates corresponding to each text may be used as the display position of the selectable element, or the position coordinates corresponding to each text may be subjected to fusion processing, such as averaging processing, and the processing result may be used as the display position of the selectable element.

For selectable elements of an image type, such as selectable elements of a resource picture description type, based on a deep learning analysis manner, an overall image corresponding to the current presentation interface is input into a pre-training target detection model, the target detection model performs target detection on each selectable element of the image type in the current presentation interface, description information of a target object in a detection result is used as text information of the selectable element, position information of each target object in the detection result is used as a presentation position of the selectable element, or fusion processing is performed on the position information of each target element, such as averaging processing, and a processing result is used as a presentation position of the selectable element.

S203, determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element.

And S204, processing the interactive resources corresponding to the target elements and outputting a processing result.

According to the scheme of the embodiment, when the human-computer interaction requirement exists, the optional elements contained in the current display interface are analyzed in the text information and/or the display position in real time, different analysis modes are adopted for different types of optional elements when the optional elements are analyzed, the accuracy of the analysis result of the optional elements is greatly improved, and the guarantee is provided for the follow-up accurate interaction result based on the real-time analysis result, the target elements matched with the voice information are determined, the interaction resources corresponding to the target elements are processed, and the output accurate interaction result.

Optionally, in this embodiment, another possible implementation manner of parsing the selectable element included in the current presentation interface and determining the parsing result of the selectable element may further include: acquiring an original screen capture image of a current display interface; compressing the original screenshot image to obtain a compressed screenshot image; and analyzing the selectable elements contained in the compressed screenshot image, and determining the analysis result of the selectable elements. Specifically, when the optional element included in the current display interface is analyzed, the current display interface may be first subjected to screen capture by using a screen capture tool to obtain a screen capture image, and then the screen capture image is subjected to compression processing such as quality compression, size compression and image clipping by using a preset image compression mode to obtain a compressed screen capture image. The scheme adopts the analysis of the optional elements of the compressed image, greatly reduces the data volume of element analysis, improves the element analysis efficiency, and further improves the human-computer interaction efficiency of the interactive object and the interactive equipment.

Further, the present embodiment may also combine the two possible implementation manners, that is, obtain the original screenshot image of the current display interface; compressing the original screenshot image to obtain a compressed screenshot image; determining the types of the selectable elements contained in the compressed screenshot image, and analyzing the different types of the selectable elements in different types of analyzing modes to obtain analyzing results of the different types of the selectable elements. Therefore, the accuracy of the analytic result is ensured while the element analytic efficiency is improved.

Fig. 3 is a schematic diagram of an interaction method according to an embodiment of the present disclosure, and on the basis of the above embodiment, the embodiment of the present disclosure further explains in detail how to determine a target element corresponding to voice information according to the received voice information and an analysis result of an optional element, as shown in fig. 3, the interaction method provided in this embodiment may include:

s301, analyzing the selectable elements contained in the current display interface, and determining the analysis result of the selectable elements.

And the analysis result comprises text information and/or the display position of the optional element on the current display interface.

S302, matching the received voice information with the text information of the optional elements to obtain at least two matching elements associated with the voice information.

Optionally, in this embodiment, semantic parsing may be performed on the received voice information based on a natural language recognition technology, such as a natural language processing algorithm, or a pre-recognized semantic recognition model, to determine an interaction requirement corresponding to the voice information, and then the interaction requirement is respectively matched with the text information and/or the display position of each optional element parsed in S301, and an optional element that matches the interaction requirement is determined, and if a plurality of optional elements that match the interaction requirement are determined, the plurality of optional elements may be used as matching elements of the voice information. It should be noted that, the multiple matching elements obtained in this embodiment may correspond to the same resource, or may correspond to different resources.

For example, as shown in fig. 1B, if the received voice message is "i want to see the B animation of the machine cat", the voice message is matched with the text message of the optional element at this time, the obtained matching elements are a box 16 and a box 19, and the box 16 and the box 19 correspond to the same animation resource. If the received voice message is 'i want to see the animation shown on the number 1.1', the voice message is matched with the text message of the optional element at the moment, the obtained matched elements are a box 15 and a box 16, and the box 15 and the box 16 correspond to two different animation resources.

S303, determining a target element associated with the voice information from the at least two matching elements according to the display positions of the at least two matching elements on the current display interface.

Optionally, in this embodiment, there are many ways to determine the target element associated with the voice information from the at least two matching elements according to the display positions of the at least two matching elements on the current display interface, and one implementation way is: and determining the display area range of the display position of each matching element corresponding to the current display interface, comparing the display area ranges of the matching elements, and selecting the matching element with the largest display area range, namely the most striking matching element as a target element associated with the voice information. For example, if S302 determines that the matching elements are the box 15 and the box 16 in fig. 1B, and the matched text information is "1.1 showing", the showing position of the box 15 may be the area range occupied by "1.1 showing" in the box 15, and the showing position of the box 16 may be the area range occupied by "1.1 showing" in the box 16, and since the area range occupied by "1.1 showing" in the box 15 is larger than the area range occupied by "1.1 showing" in the box 16, the box 15 may be used as the target element.

Another way to implement this is: generating an element query statement according to the display position of each matching element and the matching content of the matching element, outputting the element query statement to the interactive object, receiving voice information fed back by the interactive object based on the element query statement, and determining a target element from at least two matching elements according to the voice information fed back by the interactive object. For example, if S302 determines that the matching element is the box 15 and the box 16 in fig. 1B, and the matched text information is "mapping 1.1", then based on the position information of the box 15 and the box 16 and the matching content "mapping 1.1", an element query statement "ask you whether you want to see the animation mapping 1.1 first or mapping 1.1 second in the current interface" may be generated, and if the voice information fed back by the interactive object is "second", then the target element may be determined to be the box 16.

It should be noted that, the target element may also be determined according to the display position of the matching element in other manners, which is not limited herein.

S304, processing the interactive resources corresponding to the target elements and outputting the processing result.

According to the scheme of the embodiment, when a human-computer interaction demand exists, the text information and/or the display position of the optional elements contained in the current display interface are analyzed in real time, matching with the received voice information is determined based on a real-time analysis result, if a plurality of matched optional elements (namely matched elements) are obtained, the target elements are positioned from the matched elements based on the matched element display positions, then the interaction resources corresponding to the target elements are processed, and the processing result is output. According to the scheme, when the voice information has a plurality of matching elements, the target elements are further accurately positioned according to the display positions of the matching elements, the positioning accuracy of the target elements is improved, a guarantee is provided for the follow-up human-computer interaction based on the accurate target elements, and the accuracy and the intelligence of the human-computer interaction are further improved.

Fig. 4 is a schematic diagram of an interaction method according to an embodiment of the present disclosure, and the embodiment of the present disclosure further explains in detail how to process an interaction resource corresponding to a target element on the basis of the above embodiment, as shown in fig. 4, the interaction method provided in this embodiment may include:

s401, analyzing the selectable elements contained in the current display interface, and determining the analysis result of the selectable elements.

S402, determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element.

And S403, generating an interactive control instruction corresponding to the voice information according to the display position of the target element.

Optionally, in this embodiment, the display position of the target element may be obtained first, for example, if the display position of the target element only includes one piece of position information, the position information is directly obtained; if the display position of the target element includes multiple display positions corresponding to sub information in the text information, the display position corresponding to the currently matched sub information may be obtained, or an average value of the multiple display positions may be obtained as the finally obtained display position, and the like. After the display position of the target element is acquired, the trigger operation type corresponding to the display position, such as single click, double click or sliding, can be analyzed, and then a logic is generated based on an instruction corresponding to the trigger operation type, and an interactive control instruction corresponding to the voice information is generated by combining the acquired display position. The interaction control instruction is used for instructing the interaction device or the cloud server to simulate to execute corresponding triggering operation (such as single click, double click or sliding) at the display position of the current display interface.

S404, responding to the interactive control instruction, processing the interactive resources corresponding to the target elements, and outputting a processing result.

Optionally, in this embodiment, in response to the interactive control instruction, the process of processing the interactive resource corresponding to the target element may be: and simulating to execute a trigger operation corresponding to the interactive control instruction at the display position of the target element of the current display interface, such as simulating to trigger a single-click operation at the position of '1.1 mapping' of the box 15 in the point in fig. 1B. And then, according to the resource loading logic corresponding to the trigger operation, loading the interactive resources corresponding to the target object from the local resource library or the external resource library, processing the acquired interactive resources, generating a resource display interface corresponding to the trigger operation, and outputting the resource display interface through a display screen of the interactive equipment.

Specifically, there may be many ways of processing the interactive resource corresponding to the target element in this embodiment, for example, one way of processing is: if the resource loading logic corresponding to the trigger operation is a playing resource, the resource player may be started first, and then the interactive resource is loaded in the resource player. The other treatment mode is as follows: if the resource loading logic corresponding to the trigger operation is a display resource viewing list, all interactive resources to be displayed may be obtained first, and then the selectable elements corresponding to the interactive resources are rendered into the resource display interface. The processing may be performed by other methods, and is not limited thereto.

According to the scheme of the embodiment, when a human-computer interaction requirement exists, the text information and/or the display position of the optional element contained in the current display interface is analyzed in real time, the target element matched with the received voice information is determined based on the real-time analysis result, the interaction control instruction is generated according to the display position of the target element, the interaction control instruction is responded, the interaction resource corresponding to the target element is processed, and the processing result is output. When the interactive resources of the target elements are processed, the method simulates the logic of a user for manually operating the terminal equipment to generate the interactive control instruction corresponding to the display position of the target elements, and realizes the processing of the interactive resources by responding to the interactive control instruction.

Optionally, in order to reduce the interaction cost, when the interaction location is configured for the resource on the presentation interface, the interaction location may not be configured for each element of the resource, for example, the interaction location may be configured for only one or a small number of elements of one resource. The interaction position may be a position where an interaction response can be made by clicking, sliding and other operations on the interaction object. For this situation, the generating an interactive control instruction corresponding to the voice information according to the display position of the target element in this embodiment includes: determining an interaction position corresponding to the display position of the target element; and generating an interaction control instruction corresponding to the voice information based on the interaction position. Specifically, the display position of the target element may be converted into an interaction position corresponding to the resource to which the target element belongs, and then, based on the converted interaction position, an interaction control instruction corresponding to the voice information may be generated by combining an instruction generation logic corresponding to the trigger operation type. The control instruction generated by the method can simulate the manual control operation of the interactive object on the current display interface more accurately, the accuracy and the intelligence of the interactive process are improved, in addition, each optional element of the interactive resource does not need to be configured with a corresponding interactive position on the display interface in advance, and the interactive cost is greatly reduced.

Fig. 5 is a schematic structural diagram of an interaction apparatus provided according to an embodiment of the present disclosure, which is suitable for a case of performing human-computer interaction based on an intelligent interactive device with a screen. The method is particularly suitable for the situation that an interactive object (such as a user) performs man-machine interaction with the interactive device based on the screen display content of the interactive device. The device can be configured in intelligent screen interaction equipment or cloud server equipment and is realized by software and/or hardware, and the device can realize the interaction method of any embodiment of the disclosure. As shown in fig. 5, the interactive apparatus 500 includes:

the element analysis module 501 is configured to, if an interaction requirement is detected, analyze an optional element included in the current display interface, and determine an analysis result of the optional element, where the analysis result includes text information and/or a display position of the optional element on the current display interface;

an element matching module 502, configured to determine, according to the received voice information and an analysis result of the optional element, a target element corresponding to the voice information;

the resource processing module 503 is configured to process the interactive resource corresponding to the target element, and output a processing result.

According to the scheme of the embodiment, when the human-computer interaction requirement exists, the text information and/or the display position of the optional elements contained in the current display interface are analyzed in real time, the target elements matched with the received voice information are determined based on the real-time analysis result, then the interaction resources corresponding to the target elements are processed, and the processing result is output. According to the scheme, the interactive resources corresponding to the voice information are matched based on the analysis result of the optional elements in the current display interface during human-computer interaction, so that even if the voice information sent by the interactive object is not the accurate interactive resource name but the description text or the position of the corresponding element of the interactive resource, the interactive resources can be accurately positioned, for example, aiming at FIG. 1B, a user does not need to accurately speak the B animation, the corresponding resource of the B animation can be accurately positioned only by describing a machine cat, and the interaction effect that the user can speak instantly can be achieved. The intelligent effect of the intelligent screen interaction equipment is greatly improved. In addition, it should be further noted that, in this embodiment, the analysis result of the optional element included in the current display page is not configured in advance, but is obtained by automatically analyzing the current display interface in real time when an interaction requirement exists. According to the method, the analysis result does not need to be manually configured for the optional elements contained in the current display page in advance, the interaction cost is greatly reduced, and the defect that human-computer interaction cannot be performed based on the third-party application program due to the fact that the display boundary of the third-party application program cannot be analyzed in advance because the display interface of the third-party application program cannot be obtained can be overcome.

Further, the element parsing module 501 is specifically configured to:

determining the type of the selectable element contained in the current display interface;

and analyzing the different types of optional elements by adopting different types of analysis modes to obtain analysis results of the different types of optional elements.

Further, the element parsing module 501 is further specifically configured to:

acquiring an original screen capture image of a current display interface;

compressing the original screenshot image to obtain a compressed screenshot image;

and analyzing the selectable elements contained in the compressed screenshot image, and determining the analysis result of the selectable elements.

Further, the element matching module 502 is specifically configured to:

matching the received voice information with the text information of the optional elements to obtain at least two matching elements associated with the voice information;

and determining a target element associated with the voice information from the at least two matching elements according to the display positions of the at least two matching elements on the current display interface.

Further, the resource processing module 503 includes:

the instruction generating unit is used for generating an interactive control instruction corresponding to the voice information according to the display position of the target element;

and the instruction response unit is used for responding the interaction control instruction and processing the interaction resources corresponding to the target elements.

Further, the instruction generating unit is specifically configured to:

determining an interaction position corresponding to the display position of the target element;

and generating an interaction control instruction corresponding to the voice information based on the interaction position.

Further, the interaction apparatus 500 further includes:

and the strategy interaction module is used for outputting an interaction result corresponding to the voice information according to a preset interaction strategy if the corresponding target element does not exist in the voice information.

Further, the interaction apparatus 500 further includes: an interaction requirement detection module for performing at least one of:

detecting that the interactive equipment enters a working state;

receiving voice information;

and detecting that the distance between the interaction object and the detection object is less than the set distance.

Further, optional elements of this embodiment include: text elements and/or image elements.

The product can execute the method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the display interface, the voice information, the text information, the position information and the like of the elements meet the regulations of relevant laws and regulations without violating the customs of public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The computing unit Y01, ROM602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the interaction method. For example, in some embodiments, the interaction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the interaction method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the interaction method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

Artificial intelligence is the subject of research that causes computers to simulate certain human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.

Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An interaction method, comprising:

if the interaction requirement is detected, analyzing optional elements contained in the current display interface, and determining the analysis result of the optional elements, wherein the analysis result comprises text information and/or the display positions of the optional elements on the current display interface;

2. The method of claim 1, wherein the parsing the selectable element included in the current presentation interface and determining the parsing result of the selectable element comprises:

determining the type of the selectable elements contained in the current display interface;

3. The method of claim 1, wherein the parsing the selectable element included in the current presentation interface and determining the parsing result of the selectable element comprises:

acquiring an original screen capture image of a current display interface;

and analyzing optional elements contained in the compressed screen capture image, and determining an analysis result of the optional elements.

4. The method of claim 1, wherein the determining, according to the received voice message and the parsing result of the optional element, a target element corresponding to the voice message includes:

5. The method of claim 1, wherein the processing the interaction resource corresponding to the target element comprises:

generating an interactive control instruction corresponding to the voice information according to the display position of the target element;

and responding to the interaction control instruction, and processing the interaction resources corresponding to the target elements.

6. The method of claim 5, wherein the generating of the interactive control instruction corresponding to the voice message according to the display position of the target element comprises:

7. The method of claim 1, further comprising:

and if the voice information does not have the corresponding target element, outputting an interaction result corresponding to the voice information according to a preset interaction strategy.

8. The method of any of claims 1-7, wherein detecting an interaction need comprises at least one of:

detecting that the interactive equipment enters a working state;

receiving voice information;

9. The method of any of claims 1-7, wherein the selectable elements include: text elements and/or image elements.

10. An interaction device, comprising:

the element analysis module is used for analyzing optional elements contained in the current display interface and determining the analysis result of the optional elements if the interaction requirement is detected, wherein the analysis result comprises text information and/or the display positions of the optional elements on the current display interface;

the element matching module is used for determining a target element corresponding to the voice information according to the received voice information and the analysis result of the optional element;

and the resource processing module is used for processing the interactive resources corresponding to the target elements and outputting processing results.

11. The apparatus of claim 10, wherein the element parsing module is specifically configured to:

12. The apparatus of claim 10, wherein the element parsing module is further specifically configured to:

acquiring an original screen capture image of a current display interface;

13. The apparatus of claim 10, wherein the element matching module is specifically configured to:

14. The apparatus of claim 10, wherein the resource processing module comprises:

and the instruction response unit is used for responding to the interaction control instruction and processing the interaction resource corresponding to the target element.

15. The apparatus according to claim 14, wherein the instruction generation unit is specifically configured to:

16. The apparatus of claim 10, further comprising:

17. The apparatus of any of claims 10-16, further comprising: an interaction requirement detection module for performing at least one of:

detecting that the interactive equipment enters a working state;

receiving voice information;

and detecting that the distance between the interactive object and the interactive object is less than the set distance.

18. The apparatus of any of claims 10-16, wherein the selectable elements include: text elements and/or image elements.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the interaction method of any of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the interaction method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the interaction method according to any one of claims 1-9.