CN112542163B

CN112542163B - Intelligent voice interaction method, device and storage medium

Info

Publication number: CN112542163B
Application number: CN201910833270.3A
Authority: CN
Inventors: 罗荣刚; 陆永帅; 揭东辉
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2023-10-27
Anticipated expiration: 2039-09-04
Also published as: CN112542163A

Abstract

The application discloses an intelligent voice interaction method, equipment and a storage medium, and relates to voice technology, wherein the method can comprise the following steps: performing voice recognition on a voice request input by a user to obtain a voice recognition result; semantic understanding is carried out on the voice recognition result, and the intention of the user is recognized; if the intention of the user needs to depend on image input, extracting interesting content of the user in the acquired first image, wherein the first image is obtained by shooting an object placed in a specified shooting area by the user; and generating response content according to the voice recognition result and the content of interest of the user, and returning the response content to the user. By applying the scheme of the application, the accuracy of intelligent voice interaction results and the like can be improved.

Description

Intelligent voice interaction method, device and storage medium

[ field of technology ]

The present application relates to computer application technologies, and in particular, to an intelligent voice interaction method, apparatus, and storage medium in voice technology.

[ background Art ]

Education is a very important attribute of the existing intelligent voice interaction equipment. However, the interaction about educational content is generally monotonous, that is, only the input of voice dimension is supported, and after the voice request input by the user is obtained, the corresponding answer content is given, for example, the answer of the question is given or the corresponding video resource is given.

However, the amount of information which can be expressed by the voice is limited, and it is difficult to accurately express the complete intention of the user only by the voice, for example, the user wants to know the answer of a graphic application question and is difficult to express clearly only by voice input, and accordingly, the answer content given by the intelligent voice interaction device is likely to be inaccurate, so that the accuracy of the intelligent voice interaction result and the like are reduced.

[ application ]

In view of the above, the application provides an intelligent voice interaction method, device and storage medium.

The specific technical scheme is as follows:

an intelligent voice interaction method, comprising:

performing voice recognition on a voice request input by a user to obtain a voice recognition result;

semantic understanding is carried out on the voice recognition result, and the intention of a user is recognized;

if the intention is required to depend on image input, extracting interesting content of a user in an acquired first image, wherein the first image is obtained by shooting an object placed in a specified shooting area by the user;

and generating response content according to the voice recognition result and the content of interest of the user, and returning the response content to the user.

According to a preferred embodiment of the present application, the extracting the content of interest of the user in the acquired first image includes:

comparing the acquired first image with the second image to determine a user region of interest in the first image;

extracting the content of interest of the user from the region of interest of the user;

the first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.

According to a preferred embodiment of the present application, the determining the region of interest of the user in the first image by comparing the acquired first image with the second image includes:

acquiring a difference image of the first image and the second image;

acquiring a binary image corresponding to the difference image;

determining the pointing position of a user in the binary image;

and determining a user interest area in the first image according to the user pointing position.

According to a preferred embodiment of the present application, the determining the user pointing position in the binary image includes:

and determining a foreground pixel point closest to the central pixel point of the binary image in the binary image, and taking the position of the foreground pixel point as the pointing position of the user.

According to a preferred embodiment of the present application, the determining the user interest area in the first image according to the user pointing position includes:

and performing target segmentation on the first image through a predetermined algorithm based on the user pointing position to obtain the user region of interest containing the user pointing position.

According to a preferred embodiment of the present application, before the acquiring the difference image of the first image and the second image, the method further includes: and registering the first image and the second image.

According to a preferred embodiment of the present application, the content of interest to the user includes: text content, and/or image content.

According to a preferred embodiment of the application, the method further comprises: and if the intention does not need to depend on image input, generating response content according to the voice recognition result, and returning the response content to the user.

An intelligent voice interaction device, comprising: a voice processing unit, an image analyzing unit, and a response generating unit;

the voice processing unit is used for carrying out voice recognition on a voice request input by a user to obtain a voice recognition result, carrying out semantic understanding on the voice recognition result and recognizing the intention of the user;

the image analysis unit is used for extracting user interested contents from an acquired first image when the intention needs to depend on image input, wherein the first image is obtained by shooting an object placed in a specified shooting area by a user;

and the response generating unit is used for generating response content according to the voice recognition result and the content of interest of the user and returning the response content to the user.

According to a preferred embodiment of the present application, the image analysis unit determines a user interest area in the first image by comparing the acquired first image and second image, and extracts the user interest content from the user interest area;

According to a preferred embodiment of the present application, the image analysis unit obtains a difference image between the first image and the second image, obtains a binary image corresponding to the difference image, determines a user pointing position in the binary image, and determines a user region of interest in the first image according to the user pointing position.

According to a preferred embodiment of the present application, the image analysis unit determines a foreground pixel point closest to a center pixel point of the binary image in the binary image, and uses a position of the foreground pixel point as the pointing position of the user.

According to a preferred embodiment of the present application, the image analysis unit performs object segmentation on the first image by a predetermined algorithm based on the user pointing position, to obtain the user region of interest including the user pointing position.

According to a preferred embodiment of the application, the image analysis unit is further adapted to image register the first image and the second image before acquiring the difference image of the first image and the second image.

According to a preferred embodiment of the present application, the answer generation unit is further configured to generate answer content according to the speech recognition result and return the answer content to the user if the intention does not need to rely on image input.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding claims.

A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of the above.

A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Based on the above description, the scheme of the application can acquire the input of the user from two dimensions of the voice and the image, thereby acquiring more complete user intention, generating more accurate response content, improving the accuracy of intelligent voice interaction results and the like.

[ description of the drawings ]

Fig. 1 is a flowchart of a first embodiment of an intelligent voice interaction method according to the present application.

Fig. 2 is a flowchart of a second embodiment of the intelligent voice interaction method according to the present application.

Fig. 3 is a schematic view of a first image according to the present application.

Fig. 4 is a schematic diagram of a second image according to the present application.

Fig. 5 is a schematic diagram of a difference image according to the present application.

Fig. 6 is a schematic diagram of a binary image according to the present application.

Fig. 7 is a schematic diagram of a region of interest of a user according to the present application.

Fig. 8 is a schematic diagram of a composition structure of an embodiment of the intelligent voice interaction device according to the present application.

Fig. 9 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application.

[ detailed description ] of the application

In order to make the technical solution of the present application more clear and obvious, the solution of the present application will be further described below by referring to the accompanying drawings and examples.

It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In addition, it should be understood that the term "and/or" herein is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Fig. 1 is a flowchart of a first embodiment of an intelligent voice interaction method according to the present application. As shown in fig. 1, the following detailed implementation is included.

In 101, a voice recognition is performed on a voice request input by a user, and a voice recognition result is obtained.

At 102, semantic understanding is performed on the speech recognition results to recognize the intent of the user.

If the intention of the user needs to depend on the image input, the content of interest of the user in the acquired first image is extracted, and the first image is obtained by shooting an object placed in a specified shooting area by the user at 103.

In 104, response content is generated according to the voice recognition result and the content of interest of the user, and is returned to the user.

The user can perform voice interaction with the intelligent voice interaction device, wherein after a voice request input by the user is acquired, voice recognition can be performed according to the existing mode to obtain a voice recognition result. Further, the voice recognition result may be semantically understood, and the user's intention may or may not need to be dependent on the image input is recognized.

How to determine whether it is necessary to rely on image input is not limiting. For example, a regular expression matching method may be adopted, or a machine learning model obtained by training in advance may be used for identification, or the like. The machine learning model can be trained by using constructed training samples.

For example, if the voice request is "how this question is done" or "what meaning this english word is," it can be determined that the image input needs to be relied upon, and if the voice request is "how english of apple is read" or "why sky is blue," it can be determined that the image input needs not to be relied upon.

If the image input is not needed, the response content can be generated according to the voice recognition result in the existing mode and returned to the user. If the method can search in a designated database, response contents can be generated according to search results, and the response contents can be simple text broadcasting, video contents and the like.

If the user needs to rely on image input, the content of interest of the user in the acquired first image can be extracted, the first image is obtained by shooting an object placed in a specified shooting area by the user, and further response content can be generated according to a voice recognition result and the content of interest of the user and returned to the user.

Preferably, the method for extracting the content of interest of the user in the acquired first image may include: and comparing the acquired first image with the acquired second image to determine an ROI (Region of Interest, user interest area) in the first image, and extracting the user interest content from the user interest area. The first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by the user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.

For example, if the user wants to inquire about the reading method of an english sentence in a certain page in the textbook, the textbook can be turned over to the page and then placed in a designated shooting area so as to obtain a first image, then the user can send out a voice request "how the sentence is read" and can point the finger to the position of the english sentence in the page, accordingly, it is determined that the intention of the user needs to rely on image input, the state of the textbook needs to be kept unchanged when the second image is obtained through shooting, that is, the place where the textbook is placed and the page turned over need to be kept unchanged when the second image is obtained through shooting.

It can be seen that the first image and the second image differ only in whether or not the user's finger pointing information is contained. In practical applications, the "finger" may be replaced by another tool, such as a pen, that is, the position of the pen finger in the english word of the page, etc., which is not limited in the present application.

Preferably, for the acquired first and second images, they may be first image registered. Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing. The specific manner is not limited, and features such as SIFT (Scale-Invariant Feature Transform ) operators or corner points may be used for registration.

And then, obtaining a difference image of the first image and the second image, namely, subtracting the values of the corresponding pixel points from the first image and the second image to obtain the difference image, wherein the difference is brought by pointing of a user.

Further, a binary image corresponding to the difference image may be obtained, for example, the difference image may be subjected to corrosion expansion, binarization operation, and the like, so as to obtain a corresponding binary image. The binary image only comprises a foreground pixel point with a value of 1 and a background pixel point with a value of 0.

And analyzing the acquired binary image to determine the pointing position of the user. Preferably, a foreground pixel point closest to the central pixel point of the binary image in the binary image can be determined, and the position of the foreground pixel point is taken as the pointing position of the user.

Based on the user pointing position, a region of interest of the user in the first image may be determined. Preferably, the first image may be object-segmented by a predetermined algorithm based on the user pointing position, resulting in a user region of interest comprising the user pointing position. The specific target segmentation method is not limited, for example, target segmentation may be achieved through region growth, length and width constraints, and the like based on the pointing position of the user, or target segmentation may be achieved by using a machine learning model obtained through training in advance, and the like.

After the region of interest of the user is obtained, the content of interest of the user can be extracted therefrom, and the content of interest of the user can comprise: text content, and/or image content, etc.

If the user interested area only contains text content, the text content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the voice recognition result is "what meaning the english word is," and the content of interest of the user is "twelve," then the paraphrase of the english word can be displayed and broadcast to the user.

If the user interested area only contains the image content, the image content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the voice recognition result is "what English of the graph expresses", and the interested content of the user is a trapezoid, so that English words corresponding to the trapezoid can be displayed and broadcast to the user.

If the user interested area contains text content and image content at the same time, the content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the speech recognition result is "how the question is done", and the content of interest of the user is an application question, which contains both image content and text content, so that the answer corresponding to the application question can be displayed and broadcast to the user.

In order to facilitate interaction with a user, a screen is usually provided on the existing intelligent voice interaction device, and in order to implement the scheme of the present application, a camera is also required to be provided on the intelligent voice interaction device, and the camera needs to be capable of shooting an image of a designated shooting area, for example, when the intelligent voice interaction device is placed on a table, the designated shooting area can be a desk area below the front of the intelligent voice interaction device, and if the camera cannot shoot a corresponding area, the orientation of the camera can be adjusted by means of a camera steering tool and the like.

Based on the above description, fig. 2 is a flowchart of a second embodiment of the intelligent voice interaction method according to the present application. As shown in fig. 2, the following detailed implementation is included.

In 201, a voice recognition is performed on a voice request input by a user, and a voice recognition result is obtained.

In 202, the speech recognition result is semantically understood, and the user's intention is recognized.

The identified intent of the user may or may not need to be dependent on the image input.

In 203, it is determined whether the user's intention needs to depend on the image input, if not, 204 is performed, if yes, 205 is performed.

At 204, response content is generated based on the speech recognition result, returned to the user, and the flow is ended.

If the intention of the user does not need to depend on the image input, the answer content can be generated according to the voice recognition result in the existing mode and returned to the user.

It can be seen that the scheme of the application is compatible with the existing implementation mode and does not affect the existing implementation mode.

In 205, image registration is performed on the acquired first image and second image, where the first image and second image are obtained by photographing the same object in the same state where the user is placed in the specified photographing area, and the difference between the first image and the second image is that the first image does not include pointing information of the user, and the second image includes pointing information of the user for any area of the object.

Fig. 3 is a schematic view of a first image according to the present application. Fig. 4 is a schematic diagram of a second image according to the present application. As shown in fig. 3 and 4, assuming that the user wants to inquire about the reading method of an english sentence in a certain page in the text book, the text book can be turned to the page and then placed in a designated shooting area so as to be shot to obtain a first image, then the user can send out a voice request for how to read the sentence, and can refer to the finger to the position of the english sentence in the page, accordingly, a second image can be shot, and the state of the text book needs to be kept unchanged when shooting is performed twice before and after.

Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing.

At 206, a region of interest of the user in the first image is determined by comparing the first image with the second image.

For the first image and the second image after registration, a difference image of the first image and the second image may be obtained first, as shown in fig. 5, and fig. 5 is a schematic diagram of the difference image according to the present application.

Further, a binary image corresponding to the difference image may be obtained, for example, the difference image may be subjected to corrosion expansion and binarization operations, so as to obtain a corresponding binary image, as shown in fig. 6, and fig. 6 is a schematic diagram of the binary image according to the present application.

Based on the user pointing position, a region of interest of the user in the first image may be determined. Preferably, the first image may be object-segmented by a predetermined algorithm based on the user pointing position, resulting in a user region of interest comprising the user pointing position. The specific target segmentation method is not limited, for example, target segmentation can be realized through region growth, length and width constraint and the like based on the pointing position of the user, or target segmentation can be realized by utilizing a machine learning model obtained through pre-training. As shown in fig. 7, fig. 7 is a schematic diagram of a region of interest of a user according to the present application.

At 207, the user's content of interest is extracted from the user's region of interest.

After the region of interest of the user is obtained, the content of interest of the user can be extracted therefrom, and the content of interest of the user can comprise: text content, and/or image content, etc. As shown in FIG. 7, the extracted user-interesting content may be the text content "what's this".

At 208, response content is generated according to the speech recognition result and the content of interest of the user, and returned to the user, and the process is ended.

How to generate the response content based on the speech recognition result and the content of interest to the user is not limited.

In the manner shown in fig. 2, a round of voice interaction is completed, and the process shown in fig. 2 may be repeated later when the user inputs a voice request again.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

In the embodiments of the method, the input of the user can be obtained from two dimensions of the voice and the image, so that the complete user intention can be obtained, more accurate response content can be generated, and the accuracy of intelligent voice interaction results and the like are improved.

The above description of the method embodiments further describes the solution of the present application by means of device embodiments.

Fig. 8 is a schematic diagram of a composition structure of an embodiment of the intelligent voice interaction device according to the present application. As shown in fig. 8, includes: a voice processing unit 801, an image analysis unit 802, and a response generation unit 803.

The voice processing unit 801 is configured to perform voice recognition on a voice request input by a user, obtain a voice recognition result, perform semantic understanding on the voice recognition result, and recognize an intention of the user.

An image analysis unit 802 for extracting, when the intention of the user needs to depend on the image input, the content of interest of the user in the acquired first image, the first image being obtained by photographing an object placed in a specified photographing area by the user.

And the response generating unit 803 is used for generating response content according to the voice recognition result and the content of interest of the user and returning the response content to the user.

After the voice processing unit 801 obtains the voice request input by the user, the voice processing unit may first perform voice recognition according to the existing manner to obtain a voice recognition result, further, may perform semantic understanding on the voice recognition result, and recognize the intention of the user, where the intention may or may not depend on the image input.

If it is not necessary to rely on the image input, the answer generation unit 803 may generate answer content according to the speech recognition result, and return it to the user.

If it is necessary to rely on the input of an image, the image analysis unit 802 may extract the content of interest of the user in the obtained first image, where the first image is obtained by photographing the object placed in the specified photographing area by the user, and then the response generating unit 803 may generate the response content according to the voice recognition result and the content of interest of the user, and return the response content to the user.

Preferably, the image analysis unit 802 may determine a user region of interest in the first image by comparing the acquired first image and second image, and extract the user content of interest from the user region of interest. The first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by the user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.

Preferably, for the acquired first and second images, the image analysis unit 802 may first image register them. Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing.

Then, the image analysis unit 802 may obtain a difference image of the first image and the second image, and may obtain a binary image corresponding to the difference image, so as to determine a user pointing position in the binary image, and determine a user region of interest in the first image according to the user pointing position.

Preferably, the image analysis unit 802 may determine a foreground pixel point closest to the central pixel point of the binary image in the binary image, and take the position of the foreground pixel point as the pointing position of the user.

In addition, the image analysis unit 802 may perform object segmentation on the first image by a predetermined algorithm based on the user pointing position, thereby obtaining a user region of interest including the user pointing position.

After acquiring the region of interest of the user, the image analysis unit 802 may extract the content of interest of the user therefrom, and the content of interest of the user may include: text content, and/or image content, etc. Accordingly, the response generation unit 803 may generate response contents according to the voice recognition result and the contents of interest of the user, and return the response contents to the user.

The specific workflow of the embodiment of the apparatus shown in fig. 8 is referred to the related description in the foregoing method embodiment, and will not be repeated.

In the embodiment of the device, the input of the user can be acquired from two dimensions of the voice and the image, so that the complete user intention can be acquired, more accurate response content can be generated, and the accuracy of intelligent voice interaction results and the like are improved.

Fig. 9 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application. The computer system/server 12 shown in FIG. 9 is intended as an example, and should not be taken as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 9, the computer system/server 12 is in the form of a general purpose computing device. Components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, a bus 18 that connects the various system components, including the memory 28 and the processor 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer system/server 12 and includes both volatile and non-volatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, commonly referred to as a "hard disk drive"). Although not shown in fig. 9, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the computer system/server 12 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown in fig. 9, the network adapter 20 communicates with other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer system/server 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor 16 executes various functional applications and data processing, such as the implementation of the method in the embodiments shown in fig. 1 or 2, by running programs stored in the memory 28.

The application also discloses a computer readable storage medium having stored thereon a computer program which when executed by a processor will implement the method of the embodiments shown in fig. 1 or fig. 2.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method, etc. may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be other manners of division when actually implemented.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. An intelligent voice interaction method is characterized by comprising the following steps:

if the intention needs to depend on image input, extracting the content of interest of the user in the acquired first image, wherein the content comprises the following steps: acquiring a difference image of the first image and the second image, acquiring a binary image corresponding to the difference image, determining a user pointing position in the binary image, determining a user interested area in the first image according to the user pointing position, and extracting the user interested content from the user interested area; the first image and the second image are obtained by shooting the same object in the same state, which is placed in a designated shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the determining the user pointing position in the binary image comprises:

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the determining the user interested area in the first image according to the user pointing position comprises the following steps:

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

before the acquiring the difference image of the first image and the second image, the method further comprises: and registering the first image and the second image.

5. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the content of interest to the user comprises: text content, and/or image content.

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the method further comprises the steps of: and if the intention does not need to depend on image input, generating response content according to the voice recognition result, and returning the response content to the user.

7. An intelligent voice interaction device, comprising: a voice processing unit, an image analyzing unit, and a response generating unit;

the image analysis unit is used for extracting user interested content in the acquired first image when the intention needs to depend on image input, and comprises the following steps: acquiring a difference image of the first image and the second image, acquiring a binary image corresponding to the difference image, determining a user pointing position in the binary image, determining a user interested area in the first image according to the user pointing position, and extracting the user interested content from the user interested area; the first image and the second image are obtained by shooting the same object in the same state, which is placed in a designated shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user;

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the image analysis unit determines a foreground pixel point closest to a central pixel point of the binary image in the binary image, and takes the position of the foreground pixel point as the pointing position of the user.

9. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

and the image analysis unit performs target segmentation on the first image through a predetermined algorithm based on the user pointing position to obtain the user region of interest containing the user pointing position.

10. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the image analysis unit is further configured to image register the first image and the second image before acquiring a difference image of the first image and the second image.

11. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

12. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the response generation unit is further used for generating response content according to the voice recognition result and returning the response content to the user if the intention does not need to depend on image input.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.