CN110955818A

CN110955818A - Searching method, searching device, terminal equipment and storage medium

Info

Publication number: CN110955818A
Application number: CN201911229868.8A
Authority: CN
Inventors: 石真; 王婷
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-03

Abstract

The embodiment of the application provides a searching method, a searching device, terminal equipment and a storage medium. The method comprises the steps of obtaining voice information of a user in an interaction process, judging whether the voice information meets a target condition or not through a preset algorithm model, obtaining image data matched with the voice information if the voice information does not meet the target condition, obtaining a first search intention of the user corresponding to the image data, determining the target search intention of the user based on the first search intention and the voice information, and searching and displaying a result matched with the target search intention. By means of the method, under the condition that the voice information of the user does not meet the target condition, the target search intention of the user is confirmed through the voice information and the first search intention of the user corresponding to the image data, so that a result matched with the target search intention is searched, searching is more accurate, and user experience is improved.

Description

Searching method, searching device, terminal equipment and storage medium

Technical Field

The present application relates to the field of image search technologies, and in particular, to a search method, an apparatus, a terminal device, and a storage medium.

Background

With the continuous development of search engine technology, voice search has been gradually applied to various terminal devices. As one way, the search speech input by the user may be subjected to speech recognition to convert the search speech into words, analyze keywords in the words, search for a matching search result according to the keywords or query a corresponding question-answer result in a database of the question-answer system according to the keywords, and present the search result to the user in the form of speech, web page, words, and the like. However, when searching by using voice, a search result error usually occurs due to the fact that the voice content is not standard, and it is difficult to realize accurate search.

Disclosure of Invention

In view of the above problems, the present application provides a searching method, apparatus, terminal device and storage medium to solve the above problems.

In a first aspect, an embodiment of the present application provides a search method, where the method includes: acquiring voice information of a user in an interaction process; judging whether the voice information meets a target condition or not through a preset algorithm model, wherein the target condition is used for representing the search intention of the user which can be completely identified according to the voice information; if the voice information does not meet the target condition, acquiring image data matched with the voice information; acquiring a first search intention of a user corresponding to the image data; determining a target search intention of the user based on the first search intention and the voice information; and searching and displaying results matched with the target search intention.

Further, the acquiring the image data matched with the voice information includes: acquiring gesture information of the user based on an image recognition mode, wherein the gesture information comprises expression, gesture and/or action information of the user; and taking the image data comprising the posture information of the user as the image data matched with the voice information.

Further, the acquiring the first search intention of the user corresponding to the image data includes: and confirming a target object in a range corresponding to the posture information of the user in the image data as the first search intention.

Further, the determining the target search intention of the user based on the first search intention and the voice information includes: acquiring a distance parameter between a target object and the first search intention; and taking the search target with the minimum distance parameter as the target search intention of the user.

Further, the number of the search targets with the minimum distance parameter is multiple, and the taking the search target with the minimum distance parameter as the target search intention of the user includes: performing semantic recognition processing on the voice information to obtain search keywords; and taking the search target meeting the keyword from the search targets with the minimum distance parameter as the target search intention of the user.

Further, the determining the target search intention of the user based on the first search intention and the voice information includes: supplementing the content of the voice information according to the first search intention to obtain target voice data; and determining the target search intention of the user according to the target voice data.

Further, the acquiring the image data matched with the voice information includes: acquiring a starting instruction of a camera; acquiring image data which is acquired by the camera and comprises the posture information of the user, and taking the image data as image data matched with the voice information; or acquiring image data selected by a user as image data matched with the voice information.

In a second aspect, an embodiment of the present application provides a search apparatus, including: the first acquisition module is used for acquiring voice information of a user in the interaction process; the judging module is used for judging whether the voice information meets the target condition through a preset algorithm model; the second acquisition module is used for acquiring image data matched with the voice information if the voice information does not meet the target condition; the third acquisition module is used for acquiring the first search intention of the user corresponding to the image data; a determination module for determining a target search intention of the user based on the first search intention and the voice information; and the search processing module is used for searching and displaying the result matched with the target search intention.

Further, the second obtaining module may be specifically configured to obtain gesture information of the user based on an image recognition manner, where the gesture information includes expression, gesture, and/or action information of the user; and taking the image data comprising the posture information of the user as the image data matched with the voice information.

Further, the third obtaining module may be specifically configured to determine, as the first search intention, a target object in a range corresponding to the posture information of the user in the image data.

Further, the determining module may be specifically configured to obtain a distance parameter between the target object and the first search intention; and taking the search target with the minimum distance parameter as the target search intention of the user.

Furthermore, the number of the search targets with the minimum distance parameter is multiple, and the determining module may include a processing unit, configured to perform semantic recognition processing on the voice information to obtain a search keyword; and the target search intention determining unit is used for taking a search target meeting the keyword from a plurality of search targets with the minimum distance parameter as the target search intention of the user.

Further, the determining module may be specifically configured to supplement the content of the voice information according to the first search intention to obtain target voice data; and determining the target search intention of the user according to the target voice data.

Further, the second obtaining module may be specifically configured to obtain a start instruction of the camera; acquiring image data which is acquired by the camera and comprises the posture information of the user, and taking the image data as image data matched with the voice information; or acquiring image data selected by a user as image data matched with the voice information.

In a third aspect, an embodiment of the present application provides a terminal device, which includes: a memory; one or more processors coupled with the memory; one or more programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of the first aspect as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which program code is stored, and the program code can be called by a processor to execute the method according to the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.

Fig. 2 shows a flowchart of a method of a search method according to an embodiment of the present application.

Fig. 3 shows a flowchart of a method of searching according to another embodiment of the present application.

Fig. 4 shows a method flowchart of step S270 in fig. 3.

Fig. 5 is a schematic diagram illustrating a positional relationship between a target object and a target object in image data matched with voice information acquired in an embodiment of the present application.

Fig. 6 is a schematic diagram illustrating a positional relationship between a target object and another target object in another acquired image data matched with voice information in this embodiment.

Fig. 7 is a schematic diagram illustrating a positional relationship between a target object and a target object in image data matched with voice information acquired in another embodiment of the present application.

Fig. 8 is a schematic diagram illustrating a positional relationship between a target object and a target object in image data matched with voice information acquired in yet another embodiment of the present application.

Fig. 9 is a flowchart illustrating a method of a search method according to another embodiment of the present application.

Fig. 10 shows a block diagram of a search apparatus according to an embodiment of the present application.

Fig. 11 shows a block diagram of a terminal device for executing a search method according to an embodiment of the present application.

Fig. 12 illustrates a storage unit for storing or carrying program codes for implementing a search method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In recent years, with the accelerated breakthrough and wide application of technologies such as mobile internet, big data, cloud computing, sensors and the like, the development of artificial intelligence also enters a brand-new stage. While the intelligent voice search technology is used as a key ring in the Artificial Intelligence industry chain, AI (Artificial Intelligence) is one of the most mature technologies, and is rapidly developed in the fields of marketing customer service, intelligent home, intelligent vehicle-mounted, intelligent wearing, intelligent search and the like. Such as a cell phone smart assistant.

As one mode, the mobile phone intelligent assistant can be used for recognizing the voice input by the user, searching the content matched with the recognized voice information and displaying the content to the user through a mobile phone interface. However, if the user fails to input complete voice data into the mobile phone, the mobile phone may not be able to accurately recognize the search intention of the user, which reduces the user experience.

The inventor finds in research that the accuracy of the search can be increased by double recognition of the voice information of the user in combination with the gesture information. For example, when the user inputs voice, if the voice information input by the user is detected to be incomplete, image data matched with the voice information of the user is obtained, then the target search intention of the user is determined by combining the voice information of the user and the image data, and then a result matched with the target search intention is searched and returned to the user for displaying, so that the search accuracy can be improved. Therefore, the searching method, the searching device, the electronic device and the storage medium in the embodiment of the application are provided.

In order to better understand the searching method, the searching apparatus, the terminal device, and the storage medium provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The search method provided by the embodiment of the present application can be applied to the polymorphic interaction system 100 shown in fig. 1. The polymorphic interaction system 100 includes a terminal device 501 and a server 102, the server 102 being communicatively coupled to the terminal device 501. The server 102 may be a conventional server or a cloud server, and is not limited herein.

The terminal device 501 may be various electronic devices having a display screen and supporting data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a wearable electronic device, and the like. Specifically, the data input may be voice input based on a voice module provided on the terminal device 501, character input based on a character input module, or the like. Terminal equipment 501 is provided with the camera, and the camera can set up in the one side that terminal equipment 501 is furnished with the display screen, and is optional, and terminal equipment 501's camera also can set up in the one side that terminal equipment 501 deviates from the display screen. It should be noted that, image data of the user can be collected through the camera, and the image data includes posture information of the user, so as to assist in accurately identifying the search intention of the user.

The terminal device 501 may have a client application installed thereon, and the user may communicate with the server 102 based on the client application (e.g., APP, wechat applet, etc.). Specifically, the server 102 is installed with a corresponding server application, a user may register a user account in the server 102 based on the client application, and communicate with the server 102 based on the user account, for example, the user logs in the user account in the client application, inputs through the client application based on the user account, and may input text information, voice information, image data, and the like, after receiving information input by the user, the client application may send the information to the server 102, so that the server 102 may receive, process, and store the information, and the server 102 may also receive the information and return a corresponding output information to the terminal device 501 according to the information.

In some embodiments, the means for processing the information input by the user may also be disposed on the terminal device 501, so that the terminal device 501 does not need to rely on establishing communication with the server 102 to achieve interaction with the user, and in this case, the polymorphic interaction system 100 may only include the terminal device 501.

The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.

The search method, apparatus, terminal device and storage medium provided by the embodiments of the present application will be described in detail below with specific embodiments.

As shown in fig. 2, a flowchart of a method of a search method provided in an embodiment of the present application is shown. The searching method provided by the embodiment can be applied to terminal equipment with a display screen or other image output devices, and the terminal equipment can be electronic equipment such as a smart phone, a tablet personal computer and a wearable intelligent terminal.

In a specific embodiment, the search method may be applied to the search apparatus 400 shown in fig. 10 and the terminal device 501 shown in fig. 11. The flow shown in fig. 2 will be described in detail below. The above search method may specifically include the steps of:

step S110: and acquiring voice information of the user in the interactive process.

In this case, the voice information of the person in the segment of video or the voice information of the simulated person may be used as the voice information of the user in the interaction process, which is not limited herein. As one mode, the voice data may include a language of the user (for example, it may be determined whether the user is chinese or non-chinese, if the user speaks chinese, it may be determined whether the user speaks mandarin or non-mandarin, and if the user does not speak mandarin, it may be determined which dialect the user belongs to, for example, it may be determined whether the user speaks in the chinese language or in the south of the lake, etc.), a tone, a volume, and the like, which are not limited herein.

As a mode, the voice information of the user in the interaction process may be the voice information of the user who interacts with the terminal device through a human-computer interaction interface of the terminal device at present.

As one mode, the voice information of the user in the interactive process can be obtained by extracting the features of the voice data and then decoding the extracted voice features by using the acoustic model and the language model obtained by pre-training.

Optionally, the voice information of the user acquired by the terminal device may be stored locally, or may be sent to the server by the terminal device for storage. By means of storage of the server, the reduction of the operation speed caused by the storage data redundancy of the terminal equipment can be avoided.

Step S120: and judging whether the voice information meets a target condition through a preset algorithm model, wherein the target condition is used for representing the search intention of the user which can be completely identified according to the voice information.

As one mode, the preset algorithm model may be a Neural Network model obtained by training a large amount of sample speech data, and may be, for example, a Neural Network model such as RNN (Recurrent Neural Network) or LSTM (long short-Term Memory Network). By inputting the voice information into the preset algorithm model, whether the voice information meets the target condition or not can be judged by the preset algorithm model, namely whether the search intention of the user can be completely identified according to the voice information or not can be judged. It can be understood that if the voice information satisfies the target condition, the search intention of the user can be completely recognized according to the voice information, and if the voice information does not satisfy the target condition, the search intention of the user cannot be completely recognized according to the voice information.

Wherein the sample speech data may be speech data from a sample library. For example, a sample library may be formed by pre-storing a large amount of voice information of the user (including voice information currently entered by the user or pre-stored voice information of the user, such as voice recording data of the user, etc.), and the voice information in the sample library is processed and voice information of which the user's search intention can be completely recognized.

In some embodiments, the voice data in the sample library may also be voice data of a person in a video segment, or may also be voice data of a simulated person, and the like. When a user needs to search, the user can also select to search by adopting the voice data of the person in a section of video or the voice data of the simulated person, in this case, the preset algorithm model can judge whether the voice data used for searching meets the target condition by judging whether the search intention of the user can be completely identified according to the voice data.

For example, in a specific application scenario, suppose the user says the sentence "help me search for what is". Since the speech information lacks a definite search target, i.e. the search intention of the user cannot be identified by the speech, in this case, the speech information is incomplete, and the search intention of the user cannot be completely identified by the speech information, so that the speech information does not satisfy the target condition. Optionally, if the user says that the sentence "help me to search the blue raincoat beside me", in this case, since the search target of the sentence of voice information is clear, that is, "blue raincoat", the search intention of the user can be clearly distinguished through the sentence of voice, in this case, the search intention of the user can be more accurately identified. However, if there are several blue raincoats beside the user, the number of results obtained by the search is large, and a certain selection pressure is given to the user, and in this case, the accuracy of recognizing the search intention of the user through the voice information is still lacking.

Then, as a way, after the voice information of the user in the interactive process is acquired, whether the voice information meets the target condition or not can be judged through a preset algorithm model, and if the voice information meets the target condition, searching is performed; and if the target condition is not met, further acquiring image data matched with the voice to assist in determining the search intention of the user, and further improving the accuracy and reliability of recognition.

Step S130: and acquiring image data matched with the voice information.

Optionally, if the voice information acquired in the interaction process does not satisfy the target condition (time), image data matched with the voice information may be acquired. The image data matched with the voice information in the embodiment of the present application may be understood as image data that includes the action information of the user and is acquired while the user inputs the voice information. Alternatively, the motion information may include a gesture, expression, posture, etc. of the user.

As an embodiment, the image data matched with the voice information may be acquired by a camera of the terminal device. For example, the user may swipe the information related to the content desired to be searched against the terminal device while inputting the voice information. Optionally, if the voice information input by the user is "hello," please help me to search for the cake shop next, "and the finger indicates a direction, which is the" that "in the voice information of the user, at this time, image data including the direction of the finger of the user may be collected by a camera of the terminal device, or image data including the direction of the finger of the user and the user may be collected. Further, the direction indicated by the motion information of the user in the image data, that is, the image data including the motion information of the user can be recognized as the image data matching the voice information of the user.

Alternatively, image data including the user's motion information, which the user selects from a local database (e.g., a gallery in a mobile phone), may be used as the image data matched with the voice information. For example, a user takes a picture somewhere at 10 years old and the user's right hand is raised to point in one direction when taking the picture. After 10 years, when the user comes to a certain place again, the user sees a large change, and wants to search for the change of the direction of the finger of the user (including house construction, road construction and the like) taken before 10 years, the user can speak a voice to the mobile phone to help the user to search for the development face at the next place, and meanwhile, the user finds out the picture taken before ten years, so that the picture can be used as image data matched with the voice information of the user.

In some embodiments, the picture of the direction of the finger of the user may also be a picture taken from another user on the internet at the certain place. In addition, in the picture taken by another user, the direction pointed by the finger of the other user may be the same as the direction pointed by the finger of the user, and the specific user is not limited herein.

It should be noted that the image data matched with the voice information acquired in the embodiment of the present application may be dynamic image data, that is, the image data may be image data in a time period from the start of voice information emission to the end of voice information. Alternatively, for example, the image data obtained last time in the time period may be used, and for example, the image data obtained last time before the end of the voice information acquisition may be used as the image data matched with the voice information; the image data may include image data with a high position frequency, for example, in order to clearly lock a position area where a search target of the user is located, the acquired image data may include image data in the same direction that the user points in a plurality of consecutive times, or may include a plurality of pieces of picture data representing a certain direction selected by the user, that is, the image data may be used as image data matched with the voice information.

As still another embodiment, in the process of acquiring the voice information, the tone of each word of the acquired voice information may be detected, and the image data acquired at the time corresponding to the voice information acquired at the highest tone period may be used as the image data matched with the voice information. It is to be understood that, in the case where, in the process of inputting a certain piece of voice information for searching, when inputting key voice information most relevant to a search target, a user may inevitably raise a volume to emphasize, in this case, while the user inputs semantic information at the emphasized volume, the acquired image data including the motion information of the user may be image data most showing the search intention of the user, and in this case, the image data acquired at a time corresponding to the voice information acquired at the time of the highest-pitched time period may be taken as image data matching the voice information. Optionally, in a specific implementation, the image data acquired at the time corresponding to the other features of the voice information may be used as the image data matched with the voice information. For example, image data acquired at a time corresponding to speech information having a long speaking time in the speech information may be image data matched with the speech information. For example, when the user says "what is high, high" that is, what is "the image data acquired at the acquisition time corresponding to the voice information" high.

Step S140: a first search intention of a user corresponding to the image data is acquired.

Wherein the first search intent characterizes a general search intent of the user. Alternatively, the general direction, shape, number, and other characteristics of the search target of the user may be roughly identified by the first search intention. As one mode, by analyzing the motion information of the user in the image data matching the voice information, the first search intention of the user corresponding to the image data can be acquired.

For example, in a specific application scenario, if a finger of a user points at a "sponge baby doll" in the acquired image data, the first search intention of the user corresponding to the image data in the image data is the "sponge baby doll". Specifically, as an embodiment, a facial expression of the user may be recognized by using a face recognition algorithm, and then a rough orientation of the search intention of the user may be recognized according to the facial expression of the user, and the rough orientation may be used as the first search intention of the user corresponding to the image data. For example, assuming that four outsides are respectively standing around the user, the four outsides have different skins and languages, and the user wants to know the personal information of the top-grown outsides (e.g., from which country, what language is spoken, and some related features), the user can "please help me to search where the foreign friend comes next by" through voice data, and at the same time, the head of the user turns to the direction of the outsides that the user wants to search, and sticks a mouth to indicate which outsides are, in this case, the facial expression information of the user can be collected, and the azimuth information of the search target of the user is extracted from the facial expression information, so that the direction of the outsides that the user wants to search is located in the user is determined, and the first search intention of the user is confirmed.

In another embodiment, the motion information of the user may be recognized by motion recognition, and the feature information such as the approximate direction, shape, and number of the search target of the user may be determined. For example, the first search intention of the user may be determined by extracting features from motion information of the user in the image information, calculating position information of the feature points (for example, information such as position coordinates of the feature points and a relationship between the feature points and the position coordinates of the user), and determining information such as an approximate direction, form, or number of search targets of the user.

By acquiring the first search intention of the user corresponding to the image data, the problem of inaccurate judgment caused by single judgment of the search intention of the user by the voice data can be avoided, the real search intention of the user can be judged, and the judgment accuracy is improved.

Step S150: determining a target search intention of the user based on the first search intention and the voice information.

The target search intention can be understood as the real search intention (target) of the user, and the target search intention can enable the first search intention to be more accurate and represent the search target of the user more accurately. It can be understood that if the search is directly performed by the voice data of the user or the image data including the search target of the user, the search intention of the user cannot be accurately identified, and thus the search result is inaccurate, and the obtained search results may be many, which causes data redundancy and may bring certain operating pressure. Alternatively, if the search is performed only by the user's voice data, the search may be inaccurate due to incomplete user's voice data, for example, the voice may be distorted or incomplete due to environmental noise, the search may be made wrong due to the user's pronunciation being not standard or the speed being too fast, and the like. Then, in order to improve the accuracy and reliability of the search, the embodiment of the application determines the target search intention of the user based on the first search intention and the voice information together, so as to improve the accuracy of the search result.

Step S160: and searching and displaying results matched with the target search intention.

As one way, after the target search intention of the user is obtained, results matching the target search intention may be searched and displayed. Optionally, if the target search intention of the user is an image, the final search result of the user may be displayed in an image manner (for example, the final search result of the user is displayed in an animation pop-up manner, a video playing manner, a static or dynamic picture display manner); if the target search intention of the user is a text, displaying the final search result of the user in a text mode; if the user's search results are speech (e.g., singing voice, broadcast, etc.), the user's final search results will be displayed in the form of singing voice, broadcast, etc.

By searching for a search result matched with the target search intention, accurate search can be achieved, and user experience is improved.

Step S170: and searching and displaying a result matched with the voice information.

Optionally, in the process of acquiring the voice information of the user in the interaction process, if it is determined that the acquired voice information satisfies the target condition through a preset algorithm model, a result matched with the voice information may be directly searched based on the voice information and displayed.

According to the searching method provided by the embodiment, the voice information of the user in the interactive process is acquired, whether the voice information meets the target condition is judged through the preset algorithm model, if the voice information does not meet the target condition, the image data matched with the voice information is acquired, the first searching intention of the user corresponding to the image data is acquired, the target searching intention of the user is determined based on the first searching intention and the voice information, and then the result matched with the target searching intention is searched and displayed. By means of the method, under the condition that the voice information of the user does not meet the target condition, the target search intention of the user is confirmed through the voice information and the first search intention of the user corresponding to the image data, so that a result matched with the target search intention is searched, searching is more accurate, and user experience is improved.

As shown in fig. 3, a flowchart of a method of searching provided in another embodiment of the present application is shown, where the method includes:

step S210: and acquiring voice information of the user in the interactive process.

Step S220: and judging whether the voice information meets a target condition through a preset algorithm model, wherein the target condition is used for representing the search intention of the user which can be completely identified according to the voice information.

Step S230: and acquiring the posture information of the user based on an image recognition mode.

The data which comprise the posture information of the user in the acquired image data are reserved, and the data which do not comprise the posture information of the user are removed, so that the storage space is saved, and the operation speed is increased. As one approach, the gesture information may include expression, gesture, and/or motion information of the user. Further, the posture information of the user can be obtained based on an image recognition mode, and the specific image recognition algorithm and the recognition mode refer to the prior art. For example, the rough position or rough area of the search target of the user may be determined by extracting the key point information representing the gesture of the user's motion in the image data, and then the rough search target object of the user or the outline feature of the target object to be searched may be determined. The gesture information of the user can also be acquired in a machine recognition mode.

Step S240: and taking the image data comprising the posture information of the user as the image data matched with the voice information.

It is understood that, in order to better preliminarily determine the approximate range of the search target of the user, the image data including the posture information of the user may be used as the image data matched with the voice information to improve the accuracy of the search result.

Step S250: and confirming a target object in a range corresponding to the posture information of the user in the image data as the first search intention.

The target object is an object included in the approximate range of the final search target of the user. Alternatively, the target object may be one object or a plurality of objects.

As one implementation, if the image data is one picture, that is, only one picture including the posture information of the user in the acquired image data, the target object (in this case, a plurality of objects in one approximate range, one approximate direction, or one direction) in the range corresponding to the posture information of the user in the one picture may be confirmed as the first search intention.

As another mode, when the image data includes several pictures, that is, when the acquired image data includes posture information of the user for all of the several pictures, the approximate range of the final search target of the user can be narrowed through the pictures, and a more accurate approximate search area corresponding to the final search target of the user can be obtained. Then in this case, a small number of target objects included in the area may be confirmed as the first search intention of the user. A more accurate positioning of the user's general search area is achieved.

It should be noted that, in practical implementation, the image data may be other types of image data, such as a moving picture, a video segment, and the like, and is not limited herein. By determining the first search intention, the search error caused by directly searching by adopting voice data can be reduced, and the accuracy and reliability of the search result are improved.

Step S260: and acquiring a distance parameter between the target object and the first search intention.

It is understood that in the case where the first search intention includes a plurality of target objects, it is difficult to truly grasp the user's true search intention. And the distances between the plurality of target objects and the user cannot be equal. Therefore, as one mode, the embodiment of the application acquires the distance parameter between the target object and the first search intention, so as to reduce the error of distinguishing the real search intention of the user.

It should be noted that, in the embodiment of the present application, the target object may be a hand, a foot, a facial organ, and the like of a user; alternatively, any other object held by the user may be used, for example, a stick, a fan, a pen, a bat, a banana, or any other object that can indicate the direction or position of the user's finger, which is not limited herein. The distance parameter may be information of a distance, an angle, an orientation, and the like of the target object from the first search intention.

In one embodiment, the distance parameter between the target object and the first search intention may be calculated in a spatial coordinate system (for example, a spatial three-dimensional coordinate system, a cartesian coordinate system, or the like) by acquiring spatial feature point information of the target object and acquiring approximate azimuth or area information of the first search intention. As another mode, a distance sensor (e.g., a laser ranging sensor, an infrared sensor) may also be used to calculate a distance parameter between the target object and the first search intention, and the specific implementation process may refer to the prior art and is not described herein again.

Step S270: and taking the search target with the minimum distance parameter as the target search intention of the user.

As one way, in a case where the first search intention includes a plurality of target objects, in order to obtain a more accurate search result, a search target with the smallest distance parameter may be used as the target search intention of the user, that is, a target object closest to the target object among the plurality of target objects is used as the target search intention of the user, which is specifically described as follows:

as one way, as shown in fig. 4, step S270 may include:

step S271: and carrying out semantic recognition processing on the voice information to obtain a search keyword.

Optionally, the search keyword in the embodiment of the present application is a keyword recognized from the voice information when the user searches through the voice information. For example, as one mode, semantic recognition processing may be performed on the acquired voice information through a voice algorithm model, and then the search keyword is acquired.

In one embodiment, the noun, the adjective, and the adverb in the speech information may be recognized as the search keyword, but the pronoun may not be recognized as the search keyword because the pronoun is unknown and is likely to cause an error. Optionally, in order to increase the recognition speed of the keyword and further implement fast search, a recognition order for recognizing each kind of the search keyword in the speech information when the search keyword in the speech information is recognized may be defined. For example, the speech information may be semantically recognized in the order of nouns, adjectives, and adverbs to recognize keywords in the speech information. It should be noted that, for different scenes, the user may set the identification order of each kind, which is not limited herein.

For example, in a specific application scenario, if a user is playing in a forest in a suburb and sees that a certain leaf is very good at looking for searching for information related to the leaf, the sentence "please help me to search for specific information of a yellow leaf right in front of my right foot" is said, and if semantic recognition of voice information is preset according to the sequence of nouns, adjectives and adverbs, search keywords are sequentially recognized as "leaf", "yellow", and "right in front of my right foot".

Step S272: and taking the search target meeting the keyword from the search targets with the minimum distance parameter as the target search intention of the user.

It will be appreciated that in some cases, the number of search targets for which the distance parameter is the smallest may be plural. For example, assuming that the user is located at the center of a circle, and a plurality of search targets satisfying the minimum distance parameter are distributed on the circle (i.e. each search target on the circle has the same distance from the target object, including information such as distance, angle, and orientation), in this case, it will be difficult to determine the true search intention of the user. If each search target distributed on the ring is directly confirmed as the target search intention of the user, much time is consumed, huge power consumption is possibly brought, and user experience is reduced.

In this case, among the plurality of search targets having the smallest distance parameter, a search target satisfying the search keyword may be used as the target search intention of the user, that is, the final target search intention of the user may be confirmed in conjunction with the voice information. The following takes fig. 5 and fig. 6 as an example to explain the present embodiment:

as an embodiment, please refer to fig. 5, which is a schematic diagram illustrating a position relationship between a target object and a target object in the acquired image data matched with the voice information. As shown in fig. 5, 200 denotes a terminal device, and a display interface 201 of the terminal device 200 is displayed with a first search intention including posture information of a user and corresponding to the posture information of the user. Specifically, 203 in fig. 5 indicates an object corresponding to the posture information of the user, in this case, the object is a finger of the user; 202 denote first search intents, respectively including target objects such as 2021. It should be noted that, the number of the target objects shown in fig. 5 is 3, and more or fewer target objects may be actually implemented, which is not limited herein. By one approach, in the case shown in fig. 5, when it is recognized that the target object 203 is directed as shown in the drawing, the target object including 2021 or the like may be recognized as the first search intention 202.

Further, as shown in fig. 6, the distance between each target object in the first search intention 202 and the target object 203 may be obtained, wherein a specific distance obtaining method may adopt the existing technology, and is not described herein again. As shown in fig. 6, in this case, the target object 2021 is the closest target object to the target object 203, and the distance thereof may be represented as d2 shown in fig. 6, that is, the target object 2021 may be regarded as the target search intention of the user.

In the case where a plurality of target objects 2021 shown in fig. 6 exist around the target object 203, in order to avoid the power consumption problem due to repeated searches and the user's poor user experience, a target object satisfying the search keyword at the same time among the plurality of target objects having the smallest distance parameter may be set as the target search intention of the user. For example, if 3 target objects 2021 are surrounded around the target object 203, in this case, assuming that the keyword is "please help me to search for the yellow circle closest to me", the corresponding target object 2021 satisfying the search keyword "yellow" among the 3 target objects 2021 may be determined as the target search intention of the user. It is understood that, if two of the 3 target objects 2021 are yellow, the target object search keywords may be determined again until the final target search intention of the user is determined.

As another embodiment, please refer to fig. 7, which is a schematic diagram illustrating a position relationship between a target object and a target object in the acquired image data matched with the voice information. As shown in fig. 7, a first search intention 202 and a target object 203 are displayed on a display interface 201 of the terminal apparatus 200. In this case, the object 203 is an object held by the user, and the specific type of the object will not be limited. The first search intention 202 includes a plurality of target objects 2022, etc., 3 of which are shown in fig. 7, and similarly, more or less target objects may be actually implemented, which is not limited herein. By one approach, in the case shown in fig. 7, when it is recognized that the target object 203 is directed as shown in the figure, the target object including 2022 or the like may be recognized as the first search intention 202.

Further, as shown in fig. 8, the distance of each target object in the first search intention 202 from the target object 203 may be acquired. As shown in fig. 8, in this case, the target object 2022 is the closest target object to the target object 203, and the distance thereof may be represented as d3 shown in fig. 8, that is, the target object 2022 may be regarded as the target search intention of the user. For the case where it is assumed that a plurality of target objects 2022 surround the target object 203 equally, the above description of selecting the target search intention of the user may be referred to, and will not be described herein again.

Step S280: and searching and displaying results matched with the target search intention.

Step S290: and searching and displaying a result matched with the voice information.

The searching method provided by this embodiment includes obtaining voice information of a user in an interactive process, determining whether the voice information satisfies a target condition through a preset algorithm model, if the voice information does not satisfy the target condition, obtaining posture information of the user based on an image recognition mode, using image data including the posture information of the user as image data matched with the voice information, determining a target object in a range corresponding to the posture information of the user in the image data as a first search intention, obtaining a distance parameter between the target object and the first search intention, using a search object with a minimum distance parameter as a target search intention of the user, performing semantic recognition processing on the voice information to obtain a search keyword, and using the search object satisfying the keyword among a plurality of search objects with minimum distance parameters as the target search intention of the user, and searching and displaying results matched with the target search intention. According to the method, the search target meeting the keywords is used as the target search intention of the user in the search targets with the minimum distance parameters under the condition that the voice information of the user does not meet the target conditions, so that the search is more accurate, and the user experience is improved.

As shown in fig. 9, a flowchart of a method of searching provided in another embodiment of the present application is shown, where the method includes:

step S310: and acquiring voice information of the user in the interactive process.

Step S320: and judging whether the voice information meets a target condition through a preset algorithm model, wherein the target condition is used for representing the search intention of the user which can be completely identified according to the voice information.

Step S330: and acquiring a starting instruction of the camera.

As one mode, the terminal device may trigger an opening instruction of the camera when acquiring the voice information (including the voice information of the user or the voice information selected by the user) in the interaction process, and open the camera if the user confirms opening, and temporarily not open the camera if the user refuses opening or does not select the opening.

As another mode, the user can start some intelligent search software and trigger the start instruction of the camera at the same time, so as to avoid the problems of inaccurate search or untimely search caused by the start delay of the camera.

Optionally, the terminal device may be configured to automatically trigger to turn on the camera of the terminal device at the moment when the user picks up the terminal device and makes a corresponding action. For example, when the terminal device senses the change of gravity, it may be determined that the user has set the terminal device upright, and it is further predicted that the user may need to take an image, and thus the camera may be turned on.

Step S340: and acquiring image data which is acquired by the camera and comprises the attitude information of the user, and taking the image data as image data matched with the voice information.

Optionally, after the camera is turned on, the camera may be used to capture gesture information related to a representation action and the like made when the user currently wants to search for a certain target, and image data including the gesture information of the user and acquired by the camera may be further acquired as image data matched with the voice information. There are many ways to determine whether image data includes user posture information, for example, feature point information of image data is recognized by using related algorithms such as machine recognition and motion recognition, and whether the acquired image data includes user posture information is analyzed.

For example, for a certain image data including a user pointing in a certain direction with the arm pointing at forty-five degrees obliquely upward, feature point acquisition is performed on the arm, and motion trend analysis, posture estimation and the like are performed, so that the image data can be determined as image data including posture information of the user.

Alternatively, image data selected by the user may be acquired as image data matched with the voice information. For example, the user selects image data from a local database, or image data selected from a network.

Step S350: a first search intention of a user corresponding to the image data is acquired.

Step S360: and supplementing the content of the voice information according to the first search intention to obtain target voice data.

After the first search intention is obtained, the content of the voice information can be supplemented based on the first search intention so as to supplement the voice data into complete voice data, and the complete voice data is used as target voice data.

For example, in a specific application scenario, it is assumed that the voice information spoken by the user in preparation for search is "help me to search for this", and then in the image data acquired at the same time, the finger of the user points to a green round water-cultured plant, so that after content compensation is performed on the voice information, "please me to search for the green round water-cultured plant", that is, complete voice data is obtained.

Step S370: and determining the target search intention of the user according to the target voice data.

It can be understood that after the target speech data is obtained, semantic parsing is performed on the target speech data, that is, the target speech is converted into a text, so that the target search intention of the user can be obtained.

Step S380: and searching and displaying results matched with the target search intention.

Step S390: and searching and displaying a result matched with the voice information.

According to the searching method provided by the embodiment, the voice information of the user in the interactive process is obtained, whether the voice information meets the target condition is judged through the preset algorithm model, if the voice information does not meet the target condition, the starting instruction of the camera is obtained, the image data which are collected by the camera and comprise the posture information of the user are used as the image data matched with the voice information, the first searching intention of the user corresponding to the image data is obtained, the content of the voice information is supplemented according to the first searching intention, the target voice data is obtained, the target searching intention of the user is determined according to the target voice data, and then the result matched with the target searching intention is searched and displayed. By means of the method, under the condition that the voice information of the user does not meet the target condition, content supplement is carried out on the voice information based on the first search intention to obtain the target voice data, then the target search intention of the user is determined according to the target voice data, and the search accuracy is improved.

As shown in fig. 10, a block diagram of a searching apparatus 400 provided in this embodiment of the present application is shown, where the apparatus 400 operates on a terminal device having a display screen or other image output devices, and the terminal device may be an electronic device such as a smart phone, a tablet computer, a wearable smart terminal, and the apparatus 400 includes:

a first obtaining module 410, configured to obtain voice information of a user during an interaction process.

The determining module 420 is configured to determine whether the voice information meets a target condition through a preset algorithm model.

A second obtaining module 430, configured to obtain image data matched with the voice information if the voice information does not meet a target condition.

As a mode, the second obtaining module 430 may be specifically configured to obtain gesture information of the user based on an image recognition mode, where the gesture information includes expression, gesture, and/or motion information of the user; and then using the image data including the posture information of the user as the image data matched with the voice information.

Optionally, the second obtaining module 430 may be further configured to obtain a start instruction of the camera; acquiring image data which is acquired by the camera and comprises the posture information of the user, and taking the image data as image data matched with the voice information; or acquiring image data selected by a user as image data matched with the voice information.

A third obtaining module 440, configured to obtain the first search intention of the user corresponding to the image data.

As one mode, the third obtaining module 440 may be specifically configured to determine, as the first search intention, a target object in a range corresponding to the posture information of the user in the image data.

A determining module 450 for determining a target search intention of the user based on the first search intention and the voice information.

As one mode, the determining module 450 may be specifically configured to obtain a distance parameter between the target object and the first search intention; and then taking the search target with the minimum distance parameter as the target search intention of the user.

Optionally, the number of the search targets with the minimum distance parameter is multiple, and the determining module 450 may include a processing unit, configured to perform semantic recognition processing on the voice information to obtain a search keyword; and the target search intention determining unit is used for taking a search target meeting the keyword from a plurality of search targets with the minimum distance parameter as the target search intention of the user.

As another mode, the determining module 450 may be further configured to supplement the content of the voice information according to the first search intention to obtain target voice data; and determining the target search intention of the user according to the target voice data.

And a search processing module 460 for searching and displaying a result matched with the target search intention.

The search device provided by this embodiment obtains the voice information of the user in the interaction process, then judges whether the voice information meets the target condition through a preset algorithm model, if the voice information does not meet the target condition, then obtains the image data matched with the voice information, then obtains the first search intention of the user corresponding to the image data, then determines the target search intention of the user based on the first search intention and the voice information, and further searches and displays the result matched with the target search intention. By means of the method, under the condition that the voice information of the user does not meet the target condition, the target search intention of the user is confirmed through the voice information and the first search intention of the user corresponding to the image data, so that a result matched with the target search intention is searched, searching is more accurate, and user experience is improved.

The searching device provided by the embodiment of the application is used for realizing the corresponding searching method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

It can be clearly understood by those skilled in the art that the video processing apparatus provided in the embodiment of the present application can implement each process in the foregoing method embodiment, and for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to corresponding processes in the foregoing method embodiment, and are not described herein again.

In the embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 11, a block diagram of a terminal device 501 according to an embodiment of the present disclosure is shown. The terminal device 501 may be a terminal device capable of running an application, such as a smart phone, a tablet computer, and an electronic book. The terminal device 501 in the present application may include one or more of the following components: a processor 502, a memory 504, and one or more applications, wherein the one or more applications may be stored in the memory 504 and configured to be executed by the one or more processors 502, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 502 may include one or more processing cores. The processor 502 connects various parts within the entire terminal device 501 using various interfaces and lines, and performs various functions of the terminal device 501 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 504 and calling data stored in the memory 504. Alternatively, the processor 502 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 502 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is to be understood that the modem may not be integrated into the processor 502, but may be implemented by a communication chip.

The Memory 504 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 504 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 504 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal device 501 in use (such as a phonebook, audio-video data, chat log data), and the like.

Referring to fig. 12, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 600 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 600 includes a non-volatile computer-readable storage medium. The computer readable storage medium 600 has storage space for program code 610 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form.

To sum up, according to the searching method, the searching device, the terminal device and the storage medium provided by the embodiment of the application, the voice information of the user in the interactive process is obtained, whether the voice information meets the target condition is judged through the preset algorithm model, if the voice information does not meet the target condition, the image data matched with the voice information is obtained, the first searching intention of the user corresponding to the image data is obtained, the target searching intention of the user is determined based on the first searching intention and the voice information, and then the result matched with the target searching intention is searched and displayed. By means of the method, under the condition that the voice information of the user does not meet the target condition, the target search intention of the user is confirmed through the voice information and the first search intention of the user corresponding to the image data, so that a result matched with the target search intention is searched, searching is more accurate, and user experience is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of searching, the method comprising:

acquiring voice information of a user in an interaction process;

judging whether the voice information meets a target condition or not through a preset algorithm model, wherein the target condition is used for representing the search intention of the user which can be completely identified according to the voice information;

if the voice information does not meet the target condition, acquiring image data matched with the voice information;

acquiring a first search intention of a user corresponding to the image data;

determining a target search intention of the user based on the first search intention and the voice information;

and searching and displaying results matched with the target search intention.

2. The method of claim 1, wherein the step of obtaining image data matching the voice information comprises:

acquiring gesture information of the user based on an image recognition mode, wherein the gesture information comprises expression, gesture and/or action information of the user;

and taking the image data comprising the posture information of the user as the image data matched with the voice information.

3. The method of claim 2, wherein the step of obtaining a first search intent of the user corresponding to the image data comprises:

and confirming a target object in a range corresponding to the posture information of the user in the image data as the first search intention.

4. The method of any of claims 2-3, wherein the step of determining the user's target search intent based on the first search intent and the speech information comprises:

acquiring a distance parameter between a target object and the first search intention;

and taking the search target with the minimum distance parameter as the target search intention of the user.

5. The method according to claim 4, wherein the number of the search targets with the minimum distance parameter is plural, and the step of regarding the search target with the minimum distance parameter as the target search intention of the user comprises:

performing semantic recognition processing on the voice information to obtain search keywords;

and taking the search target meeting the keyword from the search targets with the minimum distance parameter as the target search intention of the user.

6. The method of any of claims 1-3, wherein the step of determining the user's target search intent based on the first search intent and the speech information comprises:

supplementing the content of the voice information according to the first search intention to obtain target voice data;

and determining the target search intention of the user according to the target voice data.

7. The method of claim 1, wherein the step of obtaining image data matching the voice information comprises:

acquiring a starting instruction of a camera;

acquiring image data which is acquired by the camera and comprises the posture information of the user, and taking the image data as image data matched with the voice information;

or acquiring image data selected by a user as image data matched with the voice information.

8. A search apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring voice information of a user in the interaction process;

the judging module is used for judging whether the voice information meets the target condition through a preset algorithm model;

the second acquisition module is used for acquiring image data matched with the voice information if the voice information does not meet the target condition;

the third acquisition module is used for acquiring the first search intention of the user corresponding to the image data;

a determination module for determining a target search intention of the user based on the first search intention and the voice information;

and the search processing module is used for searching and displaying the result matched with the target search intention.

9. A terminal device, comprising:

a memory;

one or more processors coupled with the memory;

one or more programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.