CN108334498A

CN108334498A - Method and apparatus for handling voice request

Info

Publication number: CN108334498A
Application number: CN201810124300.9A
Authority: CN
Inventors: 杨鹏; 范冰冰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2018-07-27

Abstract

The embodiment of the present application discloses the method and apparatus for handling voice request.One specific implementation mode of this method includes：In response to receiving voice request, which is parsed；Include the intention for obtaining the relevant information for specifying object in response to the analysis result to the voice request, obtains the image for including the specified object；The relevant information of the specified object is determined based on the image；The relevant information of analysis result and the specified object based on the voice request, generates voice-response information；Export the voice-response information.The embodiment of the present application includes the image for specifying object by obtaining, and can determine the relevant information of specified object, and then more accurately export voice-response information.

Description

Method and apparatus for handling voice request

Technical field

The invention relates to field of computer technology, the method and apparatus for more particularly, to handling voice request.

Background technology

The intelligent electronic devices such as intelligent sound box are interacted typically only by voice and user, and then meet the need of user It asks.For example, user asks " what color sky is ", intelligent electronic device can answer " blue ".If user is to object What is described is not specific enough, and intelligent electronic device can not obtain enough information to judge the demand of user, can not usually provide standard True response message.

Invention content

The embodiment of the present application proposes the data capture method and device for server.

In a first aspect, the embodiment of the present application provides a kind of method for handling voice request, including：In response to receiving To voice request, voice request is parsed；Include obtaining the phase for specifying object in response to the analysis result to voice request The intention of information is closed, the image for including specified object is obtained；The relevant information for specifying object is determined based on image；It is asked based on voice The relevant information of the analysis result and specified object asked generates voice-response information；Export voice-response information.

In some embodiments, include obtaining that the related of object is specified to believe in response to the intents result to voice request The intention of breath obtains the image for including specified object, including：Include obtaining to refer in response to the intents result to voice request Determine the intention of the relevant information of object, sends the instruction of image of the acquisition comprising specified object；Obtain acquisition includes specified pair The image of elephant.

In some embodiments, the image for including specified object of acquisition is obtained, including：Determine the use for sending out voice request The location information at family；Obtain the image for including specified object that the location information based on user is acquired.

In some embodiments, the location information for the user for sending out voice request is determined, including：It is determined by auditory localization Send out the location information of the user of voice request.

In some embodiments, intents are carried out to voice request, including：Voice request is converted into text sentence； Determine whether text sentence includes preset keyword；In response to determining that text sentence includes predetermined keyword, determines and be intended to solution Analysis result includes obtaining the intention for the relevant information for specifying object.

In some embodiments, the relevant information for specifying object is determined based on image, including：In response to determining voice request Analysis result include the type keyword of indicated specified object, the image data of the type indicated by type keyword The matched image of image institute of specified object is determined whether there is in library；If it is determined that the figure of the type indicated by type keyword As there is the matched image of image institute of specified object in database, by information corresponding to matched image be determined as specifying The relevant information of object；If it is determined that there is no the figures of specified object in the image data base of type indicated by type keyword As the matched image of institute, the matched image of image institute for specifying object is searched in total image data base, by the matched image of institute Corresponding information is determined as the relevant information of specified object；In response to determining that it is indicated that the analysis result of voice request does not include Specified object type keyword, the matched image of image institute for specifying object is searched in total image data base, by institute The information corresponding to image matched is determined as the relevant information of specified object.

Second aspect, the embodiment of the present application provide a kind of device for handling voice request, including：Resolution unit, It is configured to, in response to receiving voice request, parse voice request；Acquiring unit is configured in response to voice The analysis result of request includes the intention for obtaining the relevant information for specifying object, obtains the image for including specified object；It determines single Member is configured to determine the relevant information for specifying object based on image；Generation unit is configured to the parsing based on voice request As a result with the relevant information of specified object, voice-response information is generated；Output unit is configured to output voice-response information.

In some embodiments, acquiring unit is further configured to obtain the figure for including specified object as follows Picture：Include obtaining the intention for the relevant information for specifying object in response to the intents result to voice request, sends acquisition packet The instruction of image containing specified object；Obtain the image for including specified object of acquisition.

In some embodiments, acquiring unit be further configured to obtain as follows acquisition comprising specified pair The image of elephant：Determine the location information for the user for sending out voice request；Obtain the location information based on user acquired include The image of specified object.

In some embodiments, acquiring unit is further configured to determine the use for sending out voice request as follows The location information at family：The location information for the user for sending out voice request is determined by auditory localization.

In some embodiments, resolution unit is further configured to carry out intention solution to voice request as follows Analysis：Voice request is converted into text sentence；Determine whether text sentence includes preset keyword；In response to determining text language Sentence includes predetermined keyword, determines that intents result includes obtaining the intention for the relevant information for specifying object.

In some embodiments, determination unit is further configured to determine the related letter of specified object as follows Breath：The type keyword for including indicated specified object in response to determining the analysis result of voice request, in type keyword The matched image of image institute of specified object is determined whether there is in the image data base of indicated type；If it is determined that in type The matched image of image institute that there is specified object in the image data base of type indicated by keyword, by the matched image of institute Corresponding information is determined as the relevant information of specified object；If it is determined that the image data of the type indicated by type keyword There is no the matched image of the image of specified object institute in library, is searched in total image data base and the image of object is specified to be matched Image, by information corresponding to matched image be determined as the relevant information of specified object；In response to determining voice request Analysis result do not include indicated specified object type keyword, in total image data base search specify object figure As the matched image of institute, the information corresponding to the matched image of institute is determined as to specify the relevant information of object.

The third aspect, the embodiment of the present application provide a kind of electronic equipment, including：One or more processors；Storage dress It sets, for storing one or more programs, when one or more programs are executed by one or more processors so that one or more A processor realizes the method such as any embodiment in the method for handling voice request.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence realizes the method such as any embodiment in the method for handling voice request when the program is executed by processor.

Method and apparatus provided by the embodiments of the present application for handling voice request, by being asked in response to receiving voice It asks, voice request is parsed.Then, include obtaining that the related of object is specified to believe in response to the analysis result to voice request The intention of breath obtains the image for including specified object.Later, the relevant information for specifying object is determined based on image.Then, it is based on The relevant information of the analysis result of voice request and specified object generates voice-response information.Finally, output voice response letter Breath.Method provided by the embodiments of the present application includes the image for specifying object by obtaining, and can determine the related letter of specified object Breath, and then more accurately judge user view, the voice-response information for more meeting user demand is provided.

Description of the drawings

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon：

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart according to one embodiment of the method for handling voice request of the application；

Fig. 3 is the schematic diagram according to an application scenarios of the method for handling voice request of the application；

Fig. 4 is the flow chart according to another embodiment of the method for handling voice request of the application；

Fig. 5 is the flow chart according to another embodiment of the method for handling voice request of the application；

Fig. 6 is the flow chart according to another embodiment of the method for handling voice request of the application；

Fig. 7 is the structural schematic diagram according to one embodiment of the device for handling voice request of the application；

Fig. 8 is adapted for the structural schematic diagram of the computer system of the electronic equipment for realizing the embodiment of the present application.

Specific implementation mode

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, is illustrated only in attached drawing and invent relevant part with related.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows the method for handling voice request that can apply the application or the dress for handling voice request The exemplary system architecture 100 for the embodiment set.

As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 provide communication link medium.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be interacted by network 104 with server 105 with using terminal equipment 101,102,103, to receive or send out Send message etc..Various telecommunication customer end applications can be installed on terminal device 101,102,103, such as speech processing applications, Image processing application, the application of shopping class, searching class application, instant messaging tools, mailbox client, social platform software etc..

Terminal device 101,102,103 can be the various electronic equipments for having filming apparatus and supporting interactive voice, Including but not limited to smart mobile phone, tablet computer, E-book reader, pocket computer on knee and desktop computer etc..

Server 105 can be to provide the server of various services, such as to specified on terminal device 101,102,103 The relevant information of object provides the background server supported.Background server can carry out the data such as the voice request that receives The processing such as analysis, and handling result (such as voice-response information) is fed back into terminal device.

It should be noted that the method for handling voice request that the embodiment of the present application is provided can be by server 105 or terminal device 101,102,103 execute, correspondingly, the device for handling voice request can be set to server 105 Or in terminal device 101,102,103.

It should be understood that the number of the terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, the stream of one embodiment of the method for handling voice request according to the application is shown Journey 200.The method for being used to handle voice request, includes the following steps：

Step 201, in response to receiving voice request, voice request is parsed.

In the present embodiment, it is (such as shown in FIG. 1 to run electronic equipment thereon for the method for handling voice request Server) it can be responded after receiving voice request：Voice request is parsed.Voice request is with voice Form send request.

In practice, above-mentioned electronic equipment can be terminal device, then above-mentioned electronic equipment can receive user to say The voice request that the forms such as words are sent out.Above-mentioned electronic equipment can also be server, then above-mentioned electronic equipment can receive end End equipment voice request transmitted after the voice request for getting user.

In specific application scenarios, user can a flower at one's side with finger and be asked " what flower this is ".Then voice Request may include the audio signal that content is " what flower this is ".

The technologies such as speech recognition may be used to parse voice request, it is specific interior indicated by voice request to obtain Hold, particular content here is the information to be expressed of user.Voice request can be switched to word by speech recognition technology, and to text Word is identified.

In some optional realization methods of the present embodiment, text sentence is known using natural language recognition technology Not, specifically can voice request be first converted into text sentence, then text sentence is carried out using natural language recognition technology Identification.

In the present embodiment, the analysis result of voice request can also include to sending out other than the text identified The intents result of the user of voice request.The intention of user can be parsed according to text identification result.Such as it can be with base The intention of user is parsed in the intents model trained using machine learning method.Herein, it is intended that analytic modell analytical model Can train to obtain based on the sample text sentence set that user view is marked, text sentence can be characterized and anticipated with user Correspondence between figure.

Step 202, include the intention for obtaining the relevant information for specifying object in response to the analysis result to voice request, obtain Take the image for including specified object.

In the present embodiment, include the feelings for the intention for obtaining the relevant information for specifying object in the analysis result of voice request Under condition, above-mentioned electronic equipment obtains the image for including specified object.Specified object can be appointed arbitrary life entity or Article.The relevant information of specified object is and the relevant various information of specified object, for example, specified object title, classification, phase Close knowledge, related commentary etc..Analysis result includes the intention for obtaining the relevant information for specifying object, shows that user sends out voice and asks In order to allow the terminal device interacted that can obtain the relevant information of specified object, such terminal device can be according to institute for Seeking Truth The relevant information of acquisition makes voice response.Include the intention of user in analysis result, it is intended that may include obtaining specified pair The relevant information of elephant can also include other intentions.

In order to determine that the relevant information of specified object, above-mentioned electronic equipment obtain the image for including specified object.Herein, If above-mentioned electronic equipment is server, above-mentioned electronic equipment can obtain the image for specifying object from terminal device.If Above-mentioned electronic equipment is terminal device, which can be the image of terminal device shooting, can also be that terminal device receives use It is that family uploads or receive image captured by other electronic equipments.

Step 203, the relevant information for specifying object is determined based on image.

In the present embodiment, above-mentioned electronic equipment determines the relevant information of specified object based on acquired image.The figure Seem the image of specified object, includes specified object in the picture, so, specified object can be determined based on the image Relevant information.

Various ways may be used and determine the relevant information for specifying object based on image.It can be by the image and existing figure As the image in database is compared, the image for being higher than threshold value with the similarity of the image is searched.It later can be by searching for The image arrived, which determines, specifies object, for example the image found can be corresponding with the mark of specified object.It can also will specify The image of object inputs predetermined image recognition model, and the specified object for including in image is identified by the model.Figure As identification model can export mark or the title etc. of specified object.The letter that above-mentioned electronic equipment can export model later Breath is used as relevant information, and the acquisition of information that can also be exported by model specifies the relevant information of object.Here image recognition Model can be neural network model or specified algorithm etc..

In practice, the relevant information for specifying object can be searched in the database of the relevant information pre-established.Phase It may include relevant information to close in the database of information, image, the mark that the relevant information in the database can be with specified object Know corresponding at least one of title.In addition it is also possible to which mark or title based on specified object are searched in internet Rope, to obtain relevant information.

Step 204, the relevant information of the analysis result based on voice request and specified object generates voice-response information.

In the present embodiment, the related letter of analysis result and specified object of the above-mentioned electronic equipment based on voice request Breath generates voice-response information.Voice-response information is the information responded to the voice that user sends out, may finally by with The terminal device that user carries out interactive voice is exported in the form of audio.

Various ways may be used and generate voice-response information.Analysis result, relevant information and voice can be obtained in advance The mapping table of response message.After obtaining analysis result and relevant information, parsing knot is searched in mapping table Voice-response information corresponding to fruit and relevant information.The voice that analysis result and relevant information input have obtained can also be rung Information model is answered, the voice-response information exported from the model is obtained.Voice response model can characterize analysis result, related letter The correspondence of breath and voice-response information.Here voice response model can be neural network model or specified algorithm etc. Deng.

Step 205, voice-response information is exported.

In the present embodiment, above-mentioned electronic equipment exports the voice-response information after generating voice-response information.Specifically Ground, if above-mentioned electronic equipment is terminal device, above-mentioned electronic equipment can be exported by way of playing voice.Such as The above-mentioned electronic equipment of fruit is server, then voice-response information can be sent to terminal device by above-mentioned electronic equipment, to complete Output.

It is one of the application scenarios of the method according to the present embodiment for handling voice request with continued reference to Fig. 3, Fig. 3 Schematic diagram.In the application scenarios of Fig. 3, what electronic equipment 301 was sent in response to receiving user or other electronic equipments 302 Voice request 303, parses voice request；Include obtaining to specify object in response to the analysis result 304 to voice request Relevant information intention, obtain the image 305 for including specified object；The relevant information 306 for specifying object is determined based on image； The relevant information 306 of analysis result 304 and specified object based on voice request generates voice-response information 307；Export voice Response message 307.

The method that above-described embodiment of the application provides includes the image for specifying object by obtaining, and can determine specified pair The relevant information of elephant, and then voice-response information is more accurately exported to be interacted with user.

With further reference to Fig. 4, it illustrates the flows 400 of another embodiment of the method for handling voice request. This is used to handle the flow 400 of the method for voice request, includes the following steps：

Step 401, in response to receiving voice request, voice request is converted into text sentence.

In the present embodiment, it is (such as shown in FIG. 1 to run electronic equipment thereon for the method for handling voice request Server) it can be responded after receiving voice request：Voice request is converted into text sentence.

Step 402, determine whether text sentence includes preset keyword.

In the present embodiment, above-mentioned electronic equipment determines whether above-mentioned text sentence includes preset keyword.Here Keyword is for determining whether user has the intention for obtaining the relevant information for specifying object, that is, in order to determine analysis result Whether include the intention for obtaining the relevant information for specifying object.For example, preset keyword can be " identification ", " identification of taking pictures " Etc..

It may be used and searched whether comprising default in keyword match technique text sentence made of voice request conversion Keyword.Optionally, when carrying out Keywords matching, the mode of fuzzy matching may be used, judge the word in text sentence It is whether close with preset keywords semantics, if so, successful match can be determined, that is, determine whether text sentence includes default Keyword.

Step 403, in response to determining that text sentence includes predetermined keyword, determine that analysis result includes obtaining to specify object Relevant information intention.

In the present embodiment, above-mentioned electronic equipment is in response to determining that text sentence includes predetermined keyword, it may be determined that solution Analysis result includes obtaining the intention for the relevant information for specifying object.

Step 404, include the intention for obtaining the relevant information for specifying object, hair in response to the analysis result to voice request The instruction for sending acquisition to include the image of specified object.

In the present embodiment, include the intention for obtaining the relevant information for specifying object in the analysis result to voice request In the case of, above-mentioned electronic equipment can send the instruction of image of the acquisition comprising specified object.Acquisition includes the figure of specified object The instruction of picture carries out Image Acquisition to indicating equipment to specified object, generates the image for including specified object.Specifically, above-mentioned Electronic equipment can be server, which sends above-metioned instruction to the terminal device including filming apparatus, so that terminal is set It is standby that image is acquired after receiving above-metioned instruction.In addition, above-mentioned electronic equipment can also be terminal device, the terminal device is to bat It takes the photograph device and sends above-metioned instruction, so that filming apparatus acquires image after receiving above-metioned instruction.

Step 405, the image for including specified object of acquisition is obtained.

In the present embodiment, above-mentioned electronic equipment obtains the collected image for including specified object.If above-mentioned electronics Equipment is server, then receiving terminal apparatus the image collected.If above-mentioned electronic equipment is terminal device, shooting is received Device the image collected.

Specifically, various ways may be used and obtain image, for example specified object can be placed in filming apparatus by user Before camera lens, in order to which filming apparatus acquires image.

In some optional realization methods of the present embodiment, it can be acquired from rotating pick-up device by human bioequivalence The image for including user is determined in each image, obtains the image comprising user as the image for including specified object.

Usual user with the equipment with voice service function when interacting, user and the equipment with voice service function Between keep smaller distance.In this way, filming apparatus can obtain the image of multiple positions by rotating, looked for by human bioequivalence To user, and then acquire the image for including user.Later above-mentioned electronic equipment can using this include user image as comprising The image of specified object can also send shooting instruction, so that shooting fills after determining the image comprising user to filming apparatus It sets to the position amplification areas imaging shooting including user.The image that filming apparatus returns is obtained later.

Step 406, the relevant information for specifying object is determined based on image.

Various ways may be used and determine the relevant information for specifying object based on image.It can will be in the image and database Image be compared, search the image for being higher than threshold value with the similarity of the image.The image of specified object can also be inputted Predetermined image recognition model identifies the specified object for including in image by the model.Image recognition model can be with Mark or title of object etc. are specified in output.Specified pair of the acquisition of information that above-mentioned electronic equipment can be exported by model later The relevant information of elephant.Here image recognition model can be neural network model or specified algorithm etc..Furthermore, it is possible to logical It crosses database or internet scans for relevant information.

Step 407, the relevant information of the analysis result based on voice request and specified object generates voice-response information.

Various ways may be used and generate voice-response information.Analysis result and voice response template can be obtained in advance Mapping table.After obtaining analysis result, the voice response corresponding to analysis result is searched in mapping table Then the relevant information of the specified object found is generated voice response letter by template with corresponding voice response form assembly Breath.Analysis result and relevant information can also be inputted the voice-response information model that obtained, obtain exporting from the model Voice-response information.Voice response model can characterize the correspondence of analysis result, relevant information and voice-response information.

Step 408, voice-response information is exported.

The embodiment of the present application is fast and accurately parsed by preset keyword, to voice request, to generate more Add accurate voice-response information.Meanwhile improving the treatment effeciency of voice request.

With further reference to Fig. 5, it illustrates the flows 500 of another embodiment of the method for handling voice request. This is used to handle the flow 500 of the method for voice request, includes the following steps：

Step 501, in response to receiving voice request, voice request is parsed.

Step 502, include the intention for obtaining the relevant information for specifying object, hair in response to the analysis result to voice request The instruction for sending acquisition to include the image of specified object.

Step 503, the location information for the user for sending out voice request is determined.

In the present embodiment, above-mentioned electronic equipment determines the location information for the user for sending out voice request.Location information can To be the corresponding at least one coordinate information in position, for example (x, y, z) can be used to indicate.It can also be the electricity for acquiring image The relative position information of sub- equipment and user, for example, user can be at 11 o'clock on the same horizontal line of the electronic equipment Direction.

In practice, user can send out voice request to terminal device, and terminal device is receiving what user sent out After voice request, voice request can also be sent out to server.

In some optional realization methods of the present embodiment, determine the user's for sending out voice request by auditory localization Location information.

In the present embodiment, user can send out voice, because of user during carrying out interactive voice with terminal device The voice request sent out is voice, and such terminal device can carry out auditory localization according to the voice request that user sends out, with true The location information of the fixed user.

Auditory localization is alternatively referred to as sound positioning.May be used steerable beam formation technology based on peak power output, High-Resolution Spectral Estimation technology or sodar digital technology etc..

Step 504, the image for including specified object that the location information based on user is acquired is obtained.

In the present embodiment, above-mentioned electronic equipment obtains the image for including specified object of acquisition, and image here is base It is acquired in the location information of user.Specifically, it is installed on terminal device or can be right independently of the filming apparatus of terminal device User location indicated by the location information of user is shot, to obtain including the image of specified object.It can also be in determination After user location, areas imaging shooting is amplified to user location.Around can also be to user location and user location Position shot.

Step 505, the relevant information for specifying object is determined based on image.

Various ways may be used and determine the relevant information for specifying object based on image.It can will be in the image and database Image be compared, search the image for being higher than threshold value with the similarity of the image.The image of specified object can also be inputted Predetermined image recognition model identifies the specified object for including in image by the model.Image recognition model can be with Mark or title of object etc. are specified in output.Specified pair of the acquisition of information that above-mentioned electronic equipment can be exported by model later The relevant information of elephant.Here image recognition model can be neural network model or specified algorithm etc..

Step 506, the relevant information of the analysis result based on voice request and specified object generates voice-response information.

Various ways may be used and generate voice-response information.Analysis result, relevant information and voice can be obtained in advance The mapping table of response message.After obtaining analysis result and relevant information, parsing knot is searched in mapping table Voice-response information corresponding to fruit and relevant information.The voice that analysis result and relevant information input have obtained can also be rung Information model is answered, the voice-response information exported from the model is obtained.Voice response model can characterize analysis result, related letter The correspondence of breath and voice-response information.

Step 507, voice-response information is exported.

The present embodiment is by determining the location information of user, even if specified object is not present in the camera lens of filming apparatus In the case of, it can also get the image of specified object, it is ensured that the generation and output of voice-response information.

With further reference to Fig. 6, it illustrates the flows 600 of another embodiment of the method for handling voice request. This is used to handle the flow 600 of the method for voice request, includes the following steps：

Step 601, in response to receiving voice request, voice request is parsed.

In specific application scenarios, user can a flower at one's side with finger and be asked " what flower this is ".

Step 602, include the intention for obtaining the relevant information for specifying object in response to the analysis result to voice request, obtain Take the image for including specified object.

In the present embodiment, include the feelings for the intention for obtaining the relevant information for specifying object in the analysis result of voice request Under condition, above-mentioned electronic equipment obtains the image for including specified object.Specified object can be appointed arbitrary life entity or Article.The relevant information of specified object is and the relevant various information of specified object.Relevant information can be each of specified object Kind information, for example, title, classification.Analysis result includes the intention for obtaining the relevant information for specifying object, shows that user sends out language Sound request is the relevant information in order to allow the terminal device interacted that can obtain specified object, and such terminal device can root Voice response is made according to acquired relevant information.Include the intention of user in analysis result, it is intended that may include obtaining to refer to Determine the relevant information of object, can also include other intentions.

Step 603, the type keyword for including indicated specified object in response to determining the analysis result of voice request, The matched image of image institute of specified object is determined whether there is in the image data base of type indicated by type keyword.

In the present embodiment, above-mentioned electronic equipment includes indicated specified object in the analysis result for determining voice request Type keyword after, respond：It is determined whether there is in the image data base of type indicated by type keyword The matched image of image institute of specified object.

Specifically, each type corresponds to an image data base, and the image in a type of image data base belongs to Same type.Image is searched in the corresponding image data base of type belonging to specified object, the work of lookup can be reduced Amount improves search speed.

Other than the intention including obtaining the relevant information for specifying object, analysis result can also include that type is crucial Word.For example, " flower ", " animal " etc..For example, user is with terminal device when interacting, and can inquire that " this is What flower ", " flower " here is exactly type keyword.

Step 604, however, it is determined that there is the figure of specified object in the image data base of the type indicated by type keyword As the matched image of institute, the information corresponding to the matched image of institute is determined as to specify the relevant information of object.

In the present embodiment, above-mentioned electronic equipment is if it is determined that in the image data base of type indicated by type keyword There are the matched image of the image of specified object institute, then the information corresponding to image that will match to is determined as the phase of specified object Close information.

Two images similarity between can referring to two images that matches is higher, for example similarity is higher than threshold value.Specifically Ground determines whether image matches may be used and knows diagram technology, OCR (Optical Character Recognition, optics Character recognition) etc..

In practice, the relevant information for specifying object can be searched in the database of the relevant information pre-established.Phase It may include relevant information to close in the database of information, the relevant information in the database and the image of specified object, mark, name At least one of title is corresponding.In addition it is also possible to after determining the matched image of image institute of specified object, determine specified The mark or title of object are scanned for based on mark or title in internet.

Step 605, however, it is determined that there is no specified objects in the image data base of the type indicated by type keyword The matched image of image institute searches the matched image of image institute for specifying object, by the matched figure of institute in total image data base As corresponding information is determined as the relevant information of specified object.

In the present embodiment, above-mentioned electronic equipment is if it is determined that the corresponding types of image of type indicated by type keyword There is no the matched image of the image of specified object institute in database, then the image for specifying object is searched in total image data base The matched image of institute.The information corresponding to image that will match to later is determined as the relevant information of specified object.

If matched image can not be found in the image data base of the type indicated by type keyword, storing Image is searched in the total image data base for having great amount of images.Here total image data base includes the figure of various types of objects Picture is different from the image data base of the type indicated by type keyword.

Step 606, in response to determining that the analysis result of voice request does not include the type key of indicated specified object Word searches the matched image of image institute for specifying object in total image data base, by information corresponding to matched image It is determined as the relevant information of specified object.

In the present embodiment, above-mentioned electronic equipment does not include indicated specified pair in the analysis result for determining voice request After the type keyword of elephant, then respond：The matched image of image institute for specifying object is searched in total image data base, By information corresponding to matched image be determined as the relevant information of specified object.That is, if can not determine specified object Affiliated type then searches the image with specified match objects directly in total image data base.Here total image data Library includes the image of various types of objects, is different from the image data base of the type indicated by type keyword.

For example, user can a flower at one's side with finger and ask " what flower this is ", the terminal interacted with user Equipment can respond that " this is not colored, this is Ha Shiqi ".

Step 607, the relevant information of the analysis result based on voice request and specified object generates voice-response information.

Step 608, voice-response information is exported.

The present embodiment can pass through the type of specified object range smaller in this way in corresponding types of image data library It is inside matched, matching speed can be accelerated, and then shorten the time of processing voice request.

With further reference to Fig. 7, as the realization to method shown in above-mentioned each figure, this application provides one kind for handling language One embodiment of the device of sound request, the device embodiment is corresponding with embodiment of the method shown in Fig. 2, which specifically may be used To be applied in various electronic equipments.

As shown in fig. 7, the device 700 for handling voice request of the present embodiment includes：Resolution unit 701 obtains list Member 702, determination unit 703, generation unit 704 and output unit 705.Wherein, resolution unit 701 are configured in response to connecing Voice request is received, voice request is parsed；Acquiring unit 702 is configured in response to the parsing knot to voice request Fruit includes the intention for obtaining the relevant information for specifying object, obtains the image for including specified object；Determination unit 703, configuration are used In the relevant information for determining specified object based on image；Generation unit 704, be configured to analysis result based on voice request and The relevant information of specified object, generates voice-response information；Output unit 705 is configured to output voice-response information.

In the present embodiment, resolution unit 701 can respond after receiving voice request：To voice request It is parsed.Voice request is the request sent in the form of speech.In practice, resolution unit 701 can be terminal device, So resolution unit 701 can receive user to speak etc. in the form of the voice request that sends out.Resolution unit 701 can also be service Device, then resolution unit 701 can be with receiving terminal apparatus voice request transmitted after the voice request for getting user.

In the present embodiment, include the feelings for the intention for obtaining the relevant information for specifying object in the analysis result of voice request Under condition, acquiring unit 702 obtains the image for including specified object.Specified object can be appointed arbitrary life entity or object Product.The relevant information of specified object is and the relevant various information of specified object, for example, specified object title, classification, correlation Knowledge, related commentary etc..Analysis result includes the intention for obtaining the relevant information for specifying object, shows that user sends out voice request It is the relevant information in order to allow the terminal device interacted that can obtain specified object, such terminal device can be according to being obtained The relevant information taken makes voice response.Include the intention of user in analysis result, it is intended that may include obtaining to specify object Relevant information, can also include other intention.

In the present embodiment, determination unit 703 determines the relevant information of specified object based on acquired image.The figure Seem the image of specified object, includes specified object in the picture, so, specified object can be determined based on the image Relevant information.

In the present embodiment, the related letter of analysis result and specified object of the generation unit 704 based on voice request Breath generates voice-response information.Voice-response information is the information responded to the voice that user sends out, may finally by with The terminal device that user carries out interactive voice is exported in the form of audio.

In the present embodiment, output unit 705 exports the voice-response information after generating voice-response information.Specifically Ground, if output unit 705 is terminal device, output unit 705 can be exported by way of playing voice.If Output unit 705 is server, then voice-response information can be sent to terminal device by output unit 705, to complete to export.

In some optional realization methods of the present embodiment, acquiring unit is further configured to obtain as follows Take the image for including specified object：Include obtaining the relevant information for specifying object in response to the intents result to voice request Intention, send the instruction of image of the acquisition comprising specified object；Obtain the image for including specified object of acquisition.

In some optional realization methods of the present embodiment, acquiring unit is further configured to obtain as follows Take the image for including specified object of acquisition：Determine the location information for the user for sending out voice request；Obtain the position based on user The image for including specified object that confidence breath is acquired.

In some optional realization methods of the present embodiment, acquiring unit is further configured to as follows really Surely the location information of the user of voice request is sent out：The location information for the user for sending out voice request is determined by auditory localization.

In some optional realization methods of the present embodiment, resolution unit is further configured to right as follows Voice request carries out intents：Voice request is converted into text sentence；Determine whether text sentence includes preset key Word；In response to determining that text sentence includes predetermined keyword, determine that intents result includes obtaining the related letter for specifying object The intention of breath.

In some optional realization methods of the present embodiment, determination unit is further configured to as follows really Surely the relevant information of specified object：In response to determining that the analysis result of voice request includes that the type of indicated specified object is closed Keyword, the image institute that specified object is determined whether there is in the image data base of the type indicated by type keyword are matched Image；If it is determined that there is the matched figure of image institute of specified object in the image data base of type indicated by type keyword Picture, by information corresponding to matched image be determined as the relevant information of specified object；If it is determined that signified in type keyword There is no the matched image of the image of specified object institute in the image data base for the type shown, searches and refer in total image data base Information corresponding to the matched image of institute is determined as specifying the relevant information of object by the matched image of image institute for determining object； In response to determining that the analysis result of voice request does not include the type keyword of indicated specified object, in total image data base It is middle to search the matched image of image institute for specifying object, the information corresponding to the matched image of institute is determined as to specify the phase of object Close information.

Below with reference to Fig. 8, it illustrates the computer systems 800 suitable for the electronic equipment for realizing the embodiment of the present application Structural schematic diagram.Electronic equipment shown in Fig. 8 is only an example, to the function of the embodiment of the present application and should not use model Shroud carrys out any restrictions.

As shown in figure 8, computer system 800 includes central processing unit (CPU) 801, it can be read-only according to being stored in Program in memory (ROM) 802 or be loaded into the program in random access storage device (RAM) 803 from storage section 808 and Execute various actions appropriate and processing.In RAM 803, also it is stored with system 800 and operates required various programs and data. CPU 801, ROM 802 and RAM 803 are connected with each other by bus 804.Input/output (I/O) interface 805 is also connected to always Line 804.

It is connected to I/O interfaces 805 with lower component：Importation 806 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 807 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section 808 including hard disk etc.； And the communications portion 809 of the network interface card including LAN card, modem etc..Communications portion 809 via such as because The network of spy's net executes communication process.Driver 810 is also according to needing to be connected to I/O interfaces 805.Detachable media 811, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 810, as needed in order to be read from thereon Computer program be mounted into storage section 808 as needed.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed by communications portion 809 from network, and/or from detachable media 811 are mounted.When the computer program is executed by central processing unit (CPU) 801, limited in execution the present processes Above-mentioned function.It should be noted that the computer-readable medium of the application can be computer-readable signal media or calculating Machine readable storage medium storing program for executing either the two arbitrarily combines.Computer readable storage medium for example can be --- but it is unlimited In --- electricity, system, device or the device of magnetic, optical, electromagnetic, infrared ray or semiconductor, or the arbitrary above combination.It calculates The more specific example of machine readable storage medium storing program for executing can include but is not limited to：Being electrically connected, be portable with one or more conducting wires Formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device or The above-mentioned any appropriate combination of person.In this application, can be any include computer readable storage medium or storage program Tangible medium, the program can be commanded execution system, device either device use or it is in connection.And in this Shen Please in, computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, In carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but not limited to Electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer-readable Any computer-readable medium other than storage medium, the computer-readable medium can send, propagate or transmit for by Instruction execution system, device either device use or program in connection.The journey for including on computer-readable medium Sequence code can transmit with any suitable medium, including but not limited to：Wirelessly, electric wire, optical cable, RF etc. or above-mentioned Any appropriate combination.

Flow chart in attached drawing and block diagram, it is illustrated that according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part for a part for one module, program segment, or code of table, the module, program segment, or code includes one or more uses The executable instruction of the logic function as defined in realization.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it to note Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be arranged in the processor, for example, can be described as：A kind of processor packet Include resolution unit, acquiring unit, determination unit, generation unit and output unit.Wherein, the title of these units is in certain situation Under do not constitute restriction to the unit itself, for example, resolution unit is also described as " asking in response to receiving voice It asks, the unit that voice request is parsed ".

As on the other hand, present invention also provides a kind of computer-readable medium, which can be Included in device described in above-described embodiment；Can also be individualism, and without be incorporated the device in.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device so that should Device：In response to receiving voice request, voice request is parsed；Include obtaining in response to the analysis result to voice request Fetching determines the intention of the relevant information of object, obtains the image for including specified object；The correlation for specifying object is determined based on image Information；The relevant information of analysis result and specified object based on voice request generates voice-response information；Export voice response Information.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Other technical solutions of arbitrary combination and formation.Such as features described above has similar work(with (but not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method for handling voice request, including：

In response to receiving voice request, the voice request is parsed；

Include the intention for obtaining the relevant information for specifying object in response to the analysis result to the voice request, it includes institute to obtain State the image of specified object；

The relevant information of the specified object is determined based on described image；

The relevant information of analysis result and the specified object based on the voice request generates voice-response information；

Export the voice-response information.

2. according to the method described in claim 1, wherein, the intents result in response to the voice request includes The intention for the relevant information for specifying object is obtained, the image for including the specified object is obtained, including：

Include obtaining the intention for the relevant information for specifying object in response to the intents result to the voice request, transmission is adopted The instruction of image of the collection comprising the specified object；

Obtain the image for including the specified object of acquisition.

3. according to the method described in claim 2, wherein, the image for including the specified object for obtaining acquisition, Including：

Determine the location information for the user for sending out voice request；

Obtain the image for including the specified object that the location information based on the user is acquired.

4. according to the method described in claim 3, wherein, the determination sends out the location information of the user of the voice request, Including：

The location information for the user for sending out voice request is determined by auditory localization.

5. it is described that intents are carried out to the voice request according to the method described in claim 1, wherein, including：

The voice request is converted into text sentence；

Determine whether the text sentence includes preset keyword；

Include predetermined keyword in response to the determination text sentence, determines that the intents result includes obtaining to specify object Relevant information intention.

6. described to determine that the related of the specified object is believed based on described image according to the method described in claim 1, wherein Breath, including：

The type keyword for including indicated specified object in response to the analysis result of the determination voice request, in the class The matched image of image institute of the specified object is determined whether there is in the image data base of type indicated by type keyword； If it is determined that in the image data base of type indicated by the type keyword, there are the images of the specified object to be matched Image, by information corresponding to matched image be determined as the relevant information of the specified object；If it is determined that in the class There is no the matched images of the image of specified object institute in the image data base of type indicated by type keyword, in total figure The matched image of image institute as searching the specified object in database, by information corresponding to matched image be determined as The relevant information of the specified object；

The type keyword for not including indicated specified object in response to the analysis result of the determination voice request, in total figure The matched image of image institute as searching the specified object in database, by information corresponding to matched image be determined as The relevant information of the specified object.

7. a kind of device for handling voice request, including：

Resolution unit is configured to, in response to receiving voice request, parse the voice request；

Acquiring unit is configured in response to the analysis result to the voice request include obtaining the relevant information for specifying object Intention, obtain and include the image of the specified object；

Determination unit is configured to determine the relevant information of the specified object based on described image；

Generation unit is configured to the relevant information of analysis result and the specified object based on the voice request, generates Voice-response information；

Output unit is configured to export the voice-response information.

8. device according to claim 7, wherein the acquiring unit is further configured to obtain as follows Include the image of the specified object：

Obtain the image for including the specified object of acquisition.

9. device according to claim 8, wherein the acquiring unit is further configured to obtain as follows The image for including the specified object of acquisition：

10. device according to claim 9, wherein the acquiring unit is further configured to as follows really Surely the location information of the user of voice request is sent out：

11. device according to claim 7, wherein the resolution unit is further configured to right as follows The voice request carries out intents：

The voice request is converted into text sentence；

Determine whether the text sentence includes preset keyword；

12. device according to claim 7, wherein the determination unit is further configured to as follows really The relevant information of the fixed specified object：

13. a kind of electronic equipment, including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-6.

14. a kind of computer readable storage medium, is stored thereon with computer program, wherein when the program is executed by processor Realize the method as described in any in claim 1-6.