CN112542163B - Intelligent voice interaction method, device and storage medium - Google Patents

Intelligent voice interaction method, device and storage medium Download PDF

Info

Publication number
CN112542163B
CN112542163B CN201910833270.3A CN201910833270A CN112542163B CN 112542163 B CN112542163 B CN 112542163B CN 201910833270 A CN201910833270 A CN 201910833270A CN 112542163 B CN112542163 B CN 112542163B
Authority
CN
China
Prior art keywords
image
user
content
voice recognition
interest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910833270.3A
Other languages
Chinese (zh)
Other versions
CN112542163A (en
Inventor
罗荣刚
陆永帅
揭东辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Shanghai Xiaodu Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Shanghai Xiaodu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd, Shanghai Xiaodu Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910833270.3A priority Critical patent/CN112542163B/en
Publication of CN112542163A publication Critical patent/CN112542163A/en
Application granted granted Critical
Publication of CN112542163B publication Critical patent/CN112542163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Abstract

The application discloses an intelligent voice interaction method, equipment and a storage medium, and relates to voice technology, wherein the method can comprise the following steps: performing voice recognition on a voice request input by a user to obtain a voice recognition result; semantic understanding is carried out on the voice recognition result, and the intention of the user is recognized; if the intention of the user needs to depend on image input, extracting interesting content of the user in the acquired first image, wherein the first image is obtained by shooting an object placed in a specified shooting area by the user; and generating response content according to the voice recognition result and the content of interest of the user, and returning the response content to the user. By applying the scheme of the application, the accuracy of intelligent voice interaction results and the like can be improved.

Description

Intelligent voice interaction method, device and storage medium
[ field of technology ]
The present application relates to computer application technologies, and in particular, to an intelligent voice interaction method, apparatus, and storage medium in voice technology.
[ background Art ]
Education is a very important attribute of the existing intelligent voice interaction equipment. However, the interaction about educational content is generally monotonous, that is, only the input of voice dimension is supported, and after the voice request input by the user is obtained, the corresponding answer content is given, for example, the answer of the question is given or the corresponding video resource is given.
However, the amount of information which can be expressed by the voice is limited, and it is difficult to accurately express the complete intention of the user only by the voice, for example, the user wants to know the answer of a graphic application question and is difficult to express clearly only by voice input, and accordingly, the answer content given by the intelligent voice interaction device is likely to be inaccurate, so that the accuracy of the intelligent voice interaction result and the like are reduced.
[ application ]
In view of the above, the application provides an intelligent voice interaction method, device and storage medium.
The specific technical scheme is as follows:
an intelligent voice interaction method, comprising:
performing voice recognition on a voice request input by a user to obtain a voice recognition result;
semantic understanding is carried out on the voice recognition result, and the intention of a user is recognized;
if the intention is required to depend on image input, extracting interesting content of a user in an acquired first image, wherein the first image is obtained by shooting an object placed in a specified shooting area by the user;
and generating response content according to the voice recognition result and the content of interest of the user, and returning the response content to the user.
According to a preferred embodiment of the present application, the extracting the content of interest of the user in the acquired first image includes:
comparing the acquired first image with the second image to determine a user region of interest in the first image;
extracting the content of interest of the user from the region of interest of the user;
the first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.
According to a preferred embodiment of the present application, the determining the region of interest of the user in the first image by comparing the acquired first image with the second image includes:
acquiring a difference image of the first image and the second image;
acquiring a binary image corresponding to the difference image;
determining the pointing position of a user in the binary image;
and determining a user interest area in the first image according to the user pointing position.
According to a preferred embodiment of the present application, the determining the user pointing position in the binary image includes:
and determining a foreground pixel point closest to the central pixel point of the binary image in the binary image, and taking the position of the foreground pixel point as the pointing position of the user.
According to a preferred embodiment of the present application, the determining the user interest area in the first image according to the user pointing position includes:
and performing target segmentation on the first image through a predetermined algorithm based on the user pointing position to obtain the user region of interest containing the user pointing position.
According to a preferred embodiment of the present application, before the acquiring the difference image of the first image and the second image, the method further includes: and registering the first image and the second image.
According to a preferred embodiment of the present application, the content of interest to the user includes: text content, and/or image content.
According to a preferred embodiment of the application, the method further comprises: and if the intention does not need to depend on image input, generating response content according to the voice recognition result, and returning the response content to the user.
An intelligent voice interaction device, comprising: a voice processing unit, an image analyzing unit, and a response generating unit;
the voice processing unit is used for carrying out voice recognition on a voice request input by a user to obtain a voice recognition result, carrying out semantic understanding on the voice recognition result and recognizing the intention of the user;
the image analysis unit is used for extracting user interested contents from an acquired first image when the intention needs to depend on image input, wherein the first image is obtained by shooting an object placed in a specified shooting area by a user;
and the response generating unit is used for generating response content according to the voice recognition result and the content of interest of the user and returning the response content to the user.
According to a preferred embodiment of the present application, the image analysis unit determines a user interest area in the first image by comparing the acquired first image and second image, and extracts the user interest content from the user interest area;
the first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.
According to a preferred embodiment of the present application, the image analysis unit obtains a difference image between the first image and the second image, obtains a binary image corresponding to the difference image, determines a user pointing position in the binary image, and determines a user region of interest in the first image according to the user pointing position.
According to a preferred embodiment of the present application, the image analysis unit determines a foreground pixel point closest to a center pixel point of the binary image in the binary image, and uses a position of the foreground pixel point as the pointing position of the user.
According to a preferred embodiment of the present application, the image analysis unit performs object segmentation on the first image by a predetermined algorithm based on the user pointing position, to obtain the user region of interest including the user pointing position.
According to a preferred embodiment of the application, the image analysis unit is further adapted to image register the first image and the second image before acquiring the difference image of the first image and the second image.
According to a preferred embodiment of the present application, the content of interest to the user includes: text content, and/or image content.
According to a preferred embodiment of the present application, the answer generation unit is further configured to generate answer content according to the speech recognition result and return the answer content to the user if the intention does not need to rely on image input.
An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding claims.
A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of the above.
A computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
Based on the above description, the scheme of the application can acquire the input of the user from two dimensions of the voice and the image, thereby acquiring more complete user intention, generating more accurate response content, improving the accuracy of intelligent voice interaction results and the like.
[ description of the drawings ]
Fig. 1 is a flowchart of a first embodiment of an intelligent voice interaction method according to the present application.
Fig. 2 is a flowchart of a second embodiment of the intelligent voice interaction method according to the present application.
Fig. 3 is a schematic view of a first image according to the present application.
Fig. 4 is a schematic diagram of a second image according to the present application.
Fig. 5 is a schematic diagram of a difference image according to the present application.
Fig. 6 is a schematic diagram of a binary image according to the present application.
Fig. 7 is a schematic diagram of a region of interest of a user according to the present application.
Fig. 8 is a schematic diagram of a composition structure of an embodiment of the intelligent voice interaction device according to the present application.
Fig. 9 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application.
[ detailed description ] of the application
In order to make the technical solution of the present application more clear and obvious, the solution of the present application will be further described below by referring to the accompanying drawings and examples.
It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
In addition, it should be understood that the term "and/or" herein is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
Fig. 1 is a flowchart of a first embodiment of an intelligent voice interaction method according to the present application. As shown in fig. 1, the following detailed implementation is included.
In 101, a voice recognition is performed on a voice request input by a user, and a voice recognition result is obtained.
At 102, semantic understanding is performed on the speech recognition results to recognize the intent of the user.
If the intention of the user needs to depend on the image input, the content of interest of the user in the acquired first image is extracted, and the first image is obtained by shooting an object placed in a specified shooting area by the user at 103.
In 104, response content is generated according to the voice recognition result and the content of interest of the user, and is returned to the user.
The user can perform voice interaction with the intelligent voice interaction device, wherein after a voice request input by the user is acquired, voice recognition can be performed according to the existing mode to obtain a voice recognition result. Further, the voice recognition result may be semantically understood, and the user's intention may or may not need to be dependent on the image input is recognized.
How to determine whether it is necessary to rely on image input is not limiting. For example, a regular expression matching method may be adopted, or a machine learning model obtained by training in advance may be used for identification, or the like. The machine learning model can be trained by using constructed training samples.
For example, if the voice request is "how this question is done" or "what meaning this english word is," it can be determined that the image input needs to be relied upon, and if the voice request is "how english of apple is read" or "why sky is blue," it can be determined that the image input needs not to be relied upon.
If the image input is not needed, the response content can be generated according to the voice recognition result in the existing mode and returned to the user. If the method can search in a designated database, response contents can be generated according to search results, and the response contents can be simple text broadcasting, video contents and the like.
If the user needs to rely on image input, the content of interest of the user in the acquired first image can be extracted, the first image is obtained by shooting an object placed in a specified shooting area by the user, and further response content can be generated according to a voice recognition result and the content of interest of the user and returned to the user.
Preferably, the method for extracting the content of interest of the user in the acquired first image may include: and comparing the acquired first image with the acquired second image to determine an ROI (Region of Interest, user interest area) in the first image, and extracting the user interest content from the user interest area. The first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by the user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.
For example, if the user wants to inquire about the reading method of an english sentence in a certain page in the textbook, the textbook can be turned over to the page and then placed in a designated shooting area so as to obtain a first image, then the user can send out a voice request "how the sentence is read" and can point the finger to the position of the english sentence in the page, accordingly, it is determined that the intention of the user needs to rely on image input, the state of the textbook needs to be kept unchanged when the second image is obtained through shooting, that is, the place where the textbook is placed and the page turned over need to be kept unchanged when the second image is obtained through shooting.
It can be seen that the first image and the second image differ only in whether or not the user's finger pointing information is contained. In practical applications, the "finger" may be replaced by another tool, such as a pen, that is, the position of the pen finger in the english word of the page, etc., which is not limited in the present application.
Preferably, for the acquired first and second images, they may be first image registered. Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing. The specific manner is not limited, and features such as SIFT (Scale-Invariant Feature Transform ) operators or corner points may be used for registration.
And then, obtaining a difference image of the first image and the second image, namely, subtracting the values of the corresponding pixel points from the first image and the second image to obtain the difference image, wherein the difference is brought by pointing of a user.
Further, a binary image corresponding to the difference image may be obtained, for example, the difference image may be subjected to corrosion expansion, binarization operation, and the like, so as to obtain a corresponding binary image. The binary image only comprises a foreground pixel point with a value of 1 and a background pixel point with a value of 0.
And analyzing the acquired binary image to determine the pointing position of the user. Preferably, a foreground pixel point closest to the central pixel point of the binary image in the binary image can be determined, and the position of the foreground pixel point is taken as the pointing position of the user.
Based on the user pointing position, a region of interest of the user in the first image may be determined. Preferably, the first image may be object-segmented by a predetermined algorithm based on the user pointing position, resulting in a user region of interest comprising the user pointing position. The specific target segmentation method is not limited, for example, target segmentation may be achieved through region growth, length and width constraints, and the like based on the pointing position of the user, or target segmentation may be achieved by using a machine learning model obtained through training in advance, and the like.
After the region of interest of the user is obtained, the content of interest of the user can be extracted therefrom, and the content of interest of the user can comprise: text content, and/or image content, etc.
If the user interested area only contains text content, the text content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the voice recognition result is "what meaning the english word is," and the content of interest of the user is "twelve," then the paraphrase of the english word can be displayed and broadcast to the user.
If the user interested area only contains the image content, the image content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the voice recognition result is "what English of the graph expresses", and the interested content of the user is a trapezoid, so that English words corresponding to the trapezoid can be displayed and broadcast to the user.
If the user interested area contains text content and image content at the same time, the content can be used as the user interested content, and further response content can be generated according to the voice recognition result and the user interested content and returned to the user. For example, the speech recognition result is "how the question is done", and the content of interest of the user is an application question, which contains both image content and text content, so that the answer corresponding to the application question can be displayed and broadcast to the user.
In order to facilitate interaction with a user, a screen is usually provided on the existing intelligent voice interaction device, and in order to implement the scheme of the present application, a camera is also required to be provided on the intelligent voice interaction device, and the camera needs to be capable of shooting an image of a designated shooting area, for example, when the intelligent voice interaction device is placed on a table, the designated shooting area can be a desk area below the front of the intelligent voice interaction device, and if the camera cannot shoot a corresponding area, the orientation of the camera can be adjusted by means of a camera steering tool and the like.
Based on the above description, fig. 2 is a flowchart of a second embodiment of the intelligent voice interaction method according to the present application. As shown in fig. 2, the following detailed implementation is included.
In 201, a voice recognition is performed on a voice request input by a user, and a voice recognition result is obtained.
In 202, the speech recognition result is semantically understood, and the user's intention is recognized.
The identified intent of the user may or may not need to be dependent on the image input.
In 203, it is determined whether the user's intention needs to depend on the image input, if not, 204 is performed, if yes, 205 is performed.
At 204, response content is generated based on the speech recognition result, returned to the user, and the flow is ended.
If the intention of the user does not need to depend on the image input, the answer content can be generated according to the voice recognition result in the existing mode and returned to the user.
It can be seen that the scheme of the application is compatible with the existing implementation mode and does not affect the existing implementation mode.
In 205, image registration is performed on the acquired first image and second image, where the first image and second image are obtained by photographing the same object in the same state where the user is placed in the specified photographing area, and the difference between the first image and the second image is that the first image does not include pointing information of the user, and the second image includes pointing information of the user for any area of the object.
Fig. 3 is a schematic view of a first image according to the present application. Fig. 4 is a schematic diagram of a second image according to the present application. As shown in fig. 3 and 4, assuming that the user wants to inquire about the reading method of an english sentence in a certain page in the text book, the text book can be turned to the page and then placed in a designated shooting area so as to be shot to obtain a first image, then the user can send out a voice request for how to read the sentence, and can refer to the finger to the position of the english sentence in the page, accordingly, a second image can be shot, and the state of the text book needs to be kept unchanged when shooting is performed twice before and after.
Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing.
At 206, a region of interest of the user in the first image is determined by comparing the first image with the second image.
For the first image and the second image after registration, a difference image of the first image and the second image may be obtained first, as shown in fig. 5, and fig. 5 is a schematic diagram of the difference image according to the present application.
Further, a binary image corresponding to the difference image may be obtained, for example, the difference image may be subjected to corrosion expansion and binarization operations, so as to obtain a corresponding binary image, as shown in fig. 6, and fig. 6 is a schematic diagram of the binary image according to the present application.
And analyzing the acquired binary image to determine the pointing position of the user. Preferably, a foreground pixel point closest to the central pixel point of the binary image in the binary image can be determined, and the position of the foreground pixel point is taken as the pointing position of the user.
Based on the user pointing position, a region of interest of the user in the first image may be determined. Preferably, the first image may be object-segmented by a predetermined algorithm based on the user pointing position, resulting in a user region of interest comprising the user pointing position. The specific target segmentation method is not limited, for example, target segmentation can be realized through region growth, length and width constraint and the like based on the pointing position of the user, or target segmentation can be realized by utilizing a machine learning model obtained through pre-training. As shown in fig. 7, fig. 7 is a schematic diagram of a region of interest of a user according to the present application.
At 207, the user's content of interest is extracted from the user's region of interest.
After the region of interest of the user is obtained, the content of interest of the user can be extracted therefrom, and the content of interest of the user can comprise: text content, and/or image content, etc. As shown in FIG. 7, the extracted user-interesting content may be the text content "what's this".
At 208, response content is generated according to the speech recognition result and the content of interest of the user, and returned to the user, and the process is ended.
How to generate the response content based on the speech recognition result and the content of interest to the user is not limited.
In the manner shown in fig. 2, a round of voice interaction is completed, and the process shown in fig. 2 may be repeated later when the user inputs a voice request again.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
In the embodiments of the method, the input of the user can be obtained from two dimensions of the voice and the image, so that the complete user intention can be obtained, more accurate response content can be generated, and the accuracy of intelligent voice interaction results and the like are improved.
The above description of the method embodiments further describes the solution of the present application by means of device embodiments.
Fig. 8 is a schematic diagram of a composition structure of an embodiment of the intelligent voice interaction device according to the present application. As shown in fig. 8, includes: a voice processing unit 801, an image analysis unit 802, and a response generation unit 803.
The voice processing unit 801 is configured to perform voice recognition on a voice request input by a user, obtain a voice recognition result, perform semantic understanding on the voice recognition result, and recognize an intention of the user.
An image analysis unit 802 for extracting, when the intention of the user needs to depend on the image input, the content of interest of the user in the acquired first image, the first image being obtained by photographing an object placed in a specified photographing area by the user.
And the response generating unit 803 is used for generating response content according to the voice recognition result and the content of interest of the user and returning the response content to the user.
After the voice processing unit 801 obtains the voice request input by the user, the voice processing unit may first perform voice recognition according to the existing manner to obtain a voice recognition result, further, may perform semantic understanding on the voice recognition result, and recognize the intention of the user, where the intention may or may not depend on the image input.
If it is not necessary to rely on the image input, the answer generation unit 803 may generate answer content according to the speech recognition result, and return it to the user.
If it is necessary to rely on the input of an image, the image analysis unit 802 may extract the content of interest of the user in the obtained first image, where the first image is obtained by photographing the object placed in the specified photographing area by the user, and then the response generating unit 803 may generate the response content according to the voice recognition result and the content of interest of the user, and return the response content to the user.
Preferably, the image analysis unit 802 may determine a user region of interest in the first image by comparing the acquired first image and second image, and extract the user content of interest from the user region of interest. The first image and the second image are obtained by shooting the same object in the same state, which is placed in the shooting area by the user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user.
Preferably, for the acquired first and second images, the image analysis unit 802 may first image register them. Since there is a certain time difference between the first image and the second image, the photographed object may slightly move as in the textbook, so that the registration operation of the first image and the second image may be performed to ensure the accuracy of the subsequent processing.
Then, the image analysis unit 802 may obtain a difference image of the first image and the second image, and may obtain a binary image corresponding to the difference image, so as to determine a user pointing position in the binary image, and determine a user region of interest in the first image according to the user pointing position.
Preferably, the image analysis unit 802 may determine a foreground pixel point closest to the central pixel point of the binary image in the binary image, and take the position of the foreground pixel point as the pointing position of the user.
In addition, the image analysis unit 802 may perform object segmentation on the first image by a predetermined algorithm based on the user pointing position, thereby obtaining a user region of interest including the user pointing position.
After acquiring the region of interest of the user, the image analysis unit 802 may extract the content of interest of the user therefrom, and the content of interest of the user may include: text content, and/or image content, etc. Accordingly, the response generation unit 803 may generate response contents according to the voice recognition result and the contents of interest of the user, and return the response contents to the user.
The specific workflow of the embodiment of the apparatus shown in fig. 8 is referred to the related description in the foregoing method embodiment, and will not be repeated.
In the embodiment of the device, the input of the user can be acquired from two dimensions of the voice and the image, so that the complete user intention can be acquired, more accurate response content can be generated, and the accuracy of intelligent voice interaction results and the like are improved.
Fig. 9 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present application. The computer system/server 12 shown in FIG. 9 is intended as an example, and should not be taken as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 9, the computer system/server 12 is in the form of a general purpose computing device. Components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, a bus 18 that connects the various system components, including the memory 28 and the processor 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer system/server 12 and includes both volatile and non-volatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, commonly referred to as a "hard disk drive"). Although not shown in fig. 9, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the computer system/server 12 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown in fig. 9, the network adapter 20 communicates with other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer system/server 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processor 16 executes various functional applications and data processing, such as the implementation of the method in the embodiments shown in fig. 1 or 2, by running programs stored in the memory 28.
The application also discloses a computer readable storage medium having stored thereon a computer program which when executed by a processor will implement the method of the embodiments shown in fig. 1 or fig. 2.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method, etc. may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be other manners of division when actually implemented.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.
The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (15)

1. An intelligent voice interaction method is characterized by comprising the following steps:
performing voice recognition on a voice request input by a user to obtain a voice recognition result;
semantic understanding is carried out on the voice recognition result, and the intention of a user is recognized;
if the intention needs to depend on image input, extracting the content of interest of the user in the acquired first image, wherein the content comprises the following steps: acquiring a difference image of the first image and the second image, acquiring a binary image corresponding to the difference image, determining a user pointing position in the binary image, determining a user interested area in the first image according to the user pointing position, and extracting the user interested content from the user interested area; the first image and the second image are obtained by shooting the same object in the same state, which is placed in a designated shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user;
and generating response content according to the voice recognition result and the content of interest of the user, and returning the response content to the user.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the determining the user pointing position in the binary image comprises:
and determining a foreground pixel point closest to the central pixel point of the binary image in the binary image, and taking the position of the foreground pixel point as the pointing position of the user.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the determining the user interested area in the first image according to the user pointing position comprises the following steps:
and performing target segmentation on the first image through a predetermined algorithm based on the user pointing position to obtain the user region of interest containing the user pointing position.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises,
before the acquiring the difference image of the first image and the second image, the method further comprises: and registering the first image and the second image.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the content of interest to the user comprises: text content, and/or image content.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the method further comprises the steps of: and if the intention does not need to depend on image input, generating response content according to the voice recognition result, and returning the response content to the user.
7. An intelligent voice interaction device, comprising: a voice processing unit, an image analyzing unit, and a response generating unit;
the voice processing unit is used for carrying out voice recognition on a voice request input by a user to obtain a voice recognition result, carrying out semantic understanding on the voice recognition result and recognizing the intention of the user;
the image analysis unit is used for extracting user interested content in the acquired first image when the intention needs to depend on image input, and comprises the following steps: acquiring a difference image of the first image and the second image, acquiring a binary image corresponding to the difference image, determining a user pointing position in the binary image, determining a user interested area in the first image according to the user pointing position, and extracting the user interested content from the user interested area; the first image and the second image are obtained by shooting the same object in the same state, which is placed in a designated shooting area by a user, and the difference between the first image and the second image is that the first image does not contain pointing information of the user, and the second image contains pointing information of any area of the object by the user;
and the response generating unit is used for generating response content according to the voice recognition result and the content of interest of the user and returning the response content to the user.
8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the image analysis unit determines a foreground pixel point closest to a central pixel point of the binary image in the binary image, and takes the position of the foreground pixel point as the pointing position of the user.
9. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
and the image analysis unit performs target segmentation on the first image through a predetermined algorithm based on the user pointing position to obtain the user region of interest containing the user pointing position.
10. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the image analysis unit is further configured to image register the first image and the second image before acquiring a difference image of the first image and the second image.
11. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the content of interest to the user comprises: text content, and/or image content.
12. The apparatus of claim 7, wherein the device comprises a plurality of sensors,
the response generation unit is further used for generating response content according to the voice recognition result and returning the response content to the user if the intention does not need to depend on image input.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.
CN201910833270.3A 2019-09-04 2019-09-04 Intelligent voice interaction method, device and storage medium Active CN112542163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910833270.3A CN112542163B (en) 2019-09-04 2019-09-04 Intelligent voice interaction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910833270.3A CN112542163B (en) 2019-09-04 2019-09-04 Intelligent voice interaction method, device and storage medium

Publications (2)

Publication Number Publication Date
CN112542163A CN112542163A (en) 2021-03-23
CN112542163B true CN112542163B (en) 2023-10-27

Family

ID=75012231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910833270.3A Active CN112542163B (en) 2019-09-04 2019-09-04 Intelligent voice interaction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN112542163B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210070029A (en) * 2019-12-04 2021-06-14 삼성전자주식회사 Device, method, and program for enhancing output content through iterative generation

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101350387B1 (en) * 2012-10-08 2014-01-13 숭실대학교산학협력단 Method for detecting hand using depth information and apparatus thereof
JP2014131195A (en) * 2012-12-28 2014-07-10 Buffalo Inc Information providing system, image display device, information providing method and program
WO2014162740A1 (en) * 2013-04-05 2014-10-09 パナソニック株式会社 Image region correlating device, three-dimensional model generating device, image region correlating method, and image region correlating program
WO2015049233A1 (en) * 2013-10-01 2015-04-09 Ventana Medical Systems, Inc. Line-based image registration and cross-image annotation devices, systems and methods
CN104820987A (en) * 2015-04-30 2015-08-05 中国电子科技集团公司第四十一研究所 Method for detecting scattering performance defect of target based on optical image and microwave image
CN106484257A (en) * 2016-09-22 2017-03-08 广东欧珀移动通信有限公司 Camera control method, device and electronic equipment
CN108009522A (en) * 2017-12-21 2018-05-08 海信集团有限公司 A kind of Approach for road detection, device and terminal
CN108133707A (en) * 2017-11-30 2018-06-08 百度在线网络技术(北京)有限公司 A kind of content share method and system
CN109192204A (en) * 2018-08-31 2019-01-11 广东小天才科技有限公司 A kind of sound control method and smart machine based on smart machine camera
WO2019018061A1 (en) * 2017-07-18 2019-01-24 Microsoft Technology Licensing, Llc Automatic integration of image capture and recognition in a voice-based query to understand intent
CN109409214A (en) * 2018-09-14 2019-03-01 浙江大华技术股份有限公司 The method and apparatus that the target object of a kind of pair of movement is classified
CN109637519A (en) * 2018-11-13 2019-04-16 百度在线网络技术(北京)有限公司 Interactive voice implementation method, device, computer equipment and storage medium
CN109753583A (en) * 2019-01-16 2019-05-14 广东小天才科技有限公司 One kind searching topic method and electronic equipment
CN110008912A (en) * 2019-04-10 2019-07-12 东北大学 A kind of social platform matching process and system based on plants identification

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8433915B2 (en) * 2006-06-28 2013-04-30 Intellisist, Inc. Selective security masking within recorded speech
US8270669B2 (en) * 2008-02-06 2012-09-18 Denso Corporation Apparatus for extracting operating object and apparatus for projecting operating hand
WO2012066557A1 (en) * 2010-11-16 2012-05-24 Hewlett-Packard Development Company L.P. System and method for using information from intuitive multimodal interactions for media tagging
EP2587450B1 (en) * 2011-10-27 2016-08-31 Nordson Corporation Method and apparatus for generating a three-dimensional model of a region of interest using an imaging system
KR102372164B1 (en) * 2015-07-24 2022-03-08 삼성전자주식회사 Image sensing apparatus, object detecting method of thereof and non-transitory computer readable recoding medium
TWI598847B (en) * 2015-10-27 2017-09-11 東友科技股份有限公司 Image jointing method
CN113822094B (en) * 2020-06-02 2024-01-16 苏州科瓴精密机械科技有限公司 Method, system, robot and storage medium for identifying working position based on image

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101350387B1 (en) * 2012-10-08 2014-01-13 숭실대학교산학협력단 Method for detecting hand using depth information and apparatus thereof
JP2014131195A (en) * 2012-12-28 2014-07-10 Buffalo Inc Information providing system, image display device, information providing method and program
WO2014162740A1 (en) * 2013-04-05 2014-10-09 パナソニック株式会社 Image region correlating device, three-dimensional model generating device, image region correlating method, and image region correlating program
WO2015049233A1 (en) * 2013-10-01 2015-04-09 Ventana Medical Systems, Inc. Line-based image registration and cross-image annotation devices, systems and methods
CN104820987A (en) * 2015-04-30 2015-08-05 中国电子科技集团公司第四十一研究所 Method for detecting scattering performance defect of target based on optical image and microwave image
CN106484257A (en) * 2016-09-22 2017-03-08 广东欧珀移动通信有限公司 Camera control method, device and electronic equipment
WO2019018061A1 (en) * 2017-07-18 2019-01-24 Microsoft Technology Licensing, Llc Automatic integration of image capture and recognition in a voice-based query to understand intent
CN108133707A (en) * 2017-11-30 2018-06-08 百度在线网络技术(北京)有限公司 A kind of content share method and system
CN108009522A (en) * 2017-12-21 2018-05-08 海信集团有限公司 A kind of Approach for road detection, device and terminal
CN109192204A (en) * 2018-08-31 2019-01-11 广东小天才科技有限公司 A kind of sound control method and smart machine based on smart machine camera
CN109409214A (en) * 2018-09-14 2019-03-01 浙江大华技术股份有限公司 The method and apparatus that the target object of a kind of pair of movement is classified
CN109637519A (en) * 2018-11-13 2019-04-16 百度在线网络技术(北京)有限公司 Interactive voice implementation method, device, computer equipment and storage medium
CN109753583A (en) * 2019-01-16 2019-05-14 广东小天才科技有限公司 One kind searching topic method and electronic equipment
CN110008912A (en) * 2019-04-10 2019-07-12 东北大学 A kind of social platform matching process and system based on plants identification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
指静脉多曝光图像的研究;王陈;《中国优秀硕士学位论文全文数据库信息科技辑》;I138-99 *

Also Published As

Publication number Publication date
CN112542163A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
US11062090B2 (en) Method and apparatus for mining general text content, server, and storage medium
CN107656922B (en) Translation method, translation device, translation terminal and storage medium
CN109034069B (en) Method and apparatus for generating information
US9766868B2 (en) Dynamic source code generation
CN110232340B (en) Method and device for establishing video classification model and video classification
WO2021062990A1 (en) Video segmentation method and apparatus, device, and medium
US9619209B1 (en) Dynamic source code generation
CN109918513B (en) Image processing method, device, server and storage medium
US20190087780A1 (en) System and method to extract and enrich slide presentations from multimodal content through cognitive computing
CN112149663A (en) RPA and AI combined image character extraction method and device and electronic equipment
CN112507090A (en) Method, apparatus, device and storage medium for outputting information
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
CN109657127B (en) Answer obtaining method, device, server and storage medium
CN112542163B (en) Intelligent voice interaction method, device and storage medium
CN107239209B (en) Photographing search method, device, terminal and storage medium
CN111027533B (en) Click-to-read coordinate transformation method, system, terminal equipment and storage medium
CN112822506A (en) Method and apparatus for analyzing video stream
CN111881900A (en) Corpus generation, translation model training and translation method, apparatus, device and medium
US20200226208A1 (en) Electronic presentation reference marker insertion
CN113807416B (en) Model training method and device, electronic equipment and storage medium
CN115630643A (en) Language model training method and device, electronic equipment and storage medium
CN115017922A (en) Method and device for translating picture, electronic equipment and readable storage medium
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN111914850B (en) Picture feature extraction method, device, server and medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210511

Address after: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant after: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) Co.,Ltd.

Applicant after: Shanghai Xiaodu Technology Co.,Ltd.

Address before: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant before: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant