WO2022193911A1

WO2022193911A1 - Instruction information acquisition method and apparatus, readable storage medium, and electronic device

Info

Publication number: WO2022193911A1
Application number: PCT/CN2022/077138
Authority: WO
Inventors: 金越; 郭彦东; 李亚乾; 侯志刚
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-03-18
Filing date: 2022-02-21
Publication date: 2022-09-22
Also published as: CN113031813A

Abstract

Provided are an instruction information acquisition method and apparatus, a readable storage medium, and an electronic device, relating to the technical field of artificial intelligence. The method comprises: acquiring a target image and performing information extraction on the target image to obtain feature information of the target image (S310); acquiring voice information and identifying text information corresponding to the voice information, wherein the voice information is information associated with the target image (S320); and generating instruction information according to the text information and the feature information of the target image (S330). By fusing the feature information of the target image and the voice information associated with the target image to generate instruction information, the present technical solution improves the accuracy of instruction information.

Description

Instruction information acquisition method and device, readable storage medium, and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application with the application number 202110292701.7 and the title of "Instruction Information Acquisition Method and Device, Readable Storage Medium, Electronic Equipment" filed on March 18, 2021, and the entire content of the Chinese patent application Incorporated herein by reference in its entirety.

technical field

The present disclosure relates to the technical field of artificial intelligence, and in particular, to a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium, and an electronic device.

Background technique

With the rapid development of artificial intelligence, more and more mobile terminals are installed with applications with voice assistant or visual assistant functions to better realize the interaction function with users.

In the related art, the mobile terminal can realize functions such as voice control and information query through a voice assistant, and can realize functions such as image information acquisition through a visual assistant. However, it is difficult for existing voice assistants or visual assistants to accurately generate user instruction information, resulting in poor user experience.

SUMMARY OF THE INVENTION

The purpose of the present disclosure is to provide a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium and an electronic device, thereby solving the problem of difficulty in accurately generating instruction information in the related art at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a method for acquiring instruction information, the method comprising: acquiring a target image and performing information extraction on the target image to obtain feature information of the target image; and acquiring voice information and identifying Text information corresponding to the voice information, wherein the voice information is information associated with the target image; instruction information is generated according to the text information and feature information of the target image.

According to a second aspect of the present disclosure, there is provided an instruction information acquisition device, the instruction information acquisition device comprising: an image information extraction module configured to acquire a target image and perform information extraction on the target image to obtain the target image feature information; a text information acquisition module is used to acquire voice information and identify the text information corresponding to the voice information, wherein the voice information is the information associated with the target image; an instruction information generation module is used for according to The text information and the feature information of the target image generate instruction information.

According to a third aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method for obtaining instruction information as described in the foregoing embodiments.

According to a fourth aspect of the present disclosure, there is provided an electronic device, comprising: one or more processors; a storage device for storing one or more programs, when the one or more programs are stored by the one or more programs When executed by each processor, the one or more processors are caused to implement the method for acquiring instruction information as described in the foregoing embodiments.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

FIG. 1 schematically shows a schematic diagram of a system architecture of the present exemplary embodiment;

FIG. 2 schematically shows a schematic diagram of the electronic device of the present exemplary embodiment;

3 schematically shows a schematic flowchart of a method for acquiring instruction information according to an embodiment of the present disclosure;

FIG. 4 schematically shows a schematic flowchart of a method for acquiring object information according to an embodiment of the present disclosure;

FIG. 5 schematically shows a schematic flowchart of a method for determining an object category according to an embodiment of the present disclosure;

6 schematically shows a schematic flowchart of a method for generating instruction information according to an embodiment of the present disclosure;

FIG. 7 schematically shows a schematic flowchart of determining a target object according to a matching result according to an embodiment of the present disclosure;

FIG. 8 schematically shows a flow chart of determining a target object from candidate objects according to an embodiment of the present disclosure;

FIG. 9 schematically shows a schematic flowchart of a method for determining a target object according to an embodiment of the present disclosure;

FIG. 10 schematically shows a schematic flowchart of another method for generating instruction information according to an embodiment of the present disclosure;

FIG. 11 schematically shows a flow chart of yet another method for generating instruction information according to an embodiment of the present disclosure;

FIG. 12 schematically shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure;

13 schematically shows a schematic flowchart of a method for acquiring instruction information according to another specific embodiment of the present disclosure;

FIG. 14 schematically shows a flow chart of a method for acquiring instruction information according to another specific embodiment of the present disclosure;

15 schematically shows a schematic structural diagram of a target image in a specific application scenario according to the present disclosure;

FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.

In the related art in the art, a mobile terminal is installed with an application program with functions of a visual assistant and a voice assistant. The visual assistant mainly captures the visual information of the user's environment, analyzes the visual information presented in the form of pictures or videos, understands the user's environment, objects and the relationship between objects, further understands the user's intention, and provides users with reasonable recommendations . The voice assistant mainly captures the user's voice information, converts the voice information into text, further analyzes the user's intention, and realizes intelligent interaction with the user. However, in the visual assistant, when analyzing the visual information presented in the form of pictures or videos, there is an inaccurate judgment on the user's intention, or when there are multiple objects in the picture or video, the user is most concerned about the object judgment. inaccurate question. In voice assistants, due to noisy ambient background sounds, outdated equipment, resulting in unclear radio reception or unclear meaning of user voices, it is difficult for voice assistants to accurately analyze user intent.

Based on the problems existing in the related art, an embodiment of the present disclosure first provides a method for acquiring instruction information, and the method for acquiring instruction information is applied to the system architecture of the exemplary embodiment of the present disclosure.

FIG. 1 shows a schematic diagram of a system architecture according to an exemplary embodiment of the present disclosure. As shown in FIG. 1 , the system architecture 100 may include: a terminal 110 , a network 120 and a server 130 . The terminal 110 may be various electronic devices with image capturing functions and audio capturing functions, including but not limited to mobile phones, tablet computers, digital cameras, personal computers, and the like. The medium used by the network 120 to provide a communication link between the terminal 110 and the server 130 may include various connection types, such as wired, wireless communication links, or fiber optic cables. It should be understood that the numbers of terminals, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks and servers according to implementation needs. For example, the server 130 may be a server cluster composed of multiple servers, or the like.

The method for acquiring instruction information provided by the embodiment of the present disclosure may be executed by the terminal 110, for example, after the terminal 110 acquires the voice information and the target image, the instruction information is generated.

In addition, the method for obtaining the instruction information provided by the embodiments of the present disclosure may also be executed by the server 130. For example, after the terminal 110 obtains the voice information and the target image, the voice information and the target image are uploaded to the server 130, so that the server 130 can generate the instruction information. , which is not limited in the present disclosure.

Exemplary embodiments of the present disclosure provide an electronic device for implementing a method for acquiring instruction information, which may be the terminal 110 or the server 130 in FIG. 1 . The electronic device includes at least a processor and a memory, the memory is used for storing executable instructions of the processor, and the processor is configured to execute the instruction information acquisition method by executing the executable instructions.

Electronic devices can be implemented in various forms, such as mobile phones, tablet computers, notebook computers, personal digital assistants (PDAs), navigation devices, wearable devices, drones and other mobile devices, as well as desktop computers, Fixed devices such as smart TVs.

The following takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes. In other embodiments, the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 . In other embodiments, the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.

As shown in FIG. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile Communication module 250, wireless communication module 260, audio module 270, speaker 271, receiver 272, microphone 273, headphone jack 274, sensor module 280, display screen 290, camera module 291, indicator 292, motor 293, buttons 294 and user Identification module (Subscriber Identification Module, SIM) card interface 295 and so on. The sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyro sensor 2803, an air pressure sensor 2804, and the like. The mobile terminal 200 implements a display function through a graphics processor (Graphics Processing Unit, abbreviation: GPU), a display screen 290, an application processor, and the like. The GPU is used to perform mathematical and geometric calculations for graphics rendering, and connects the display 290 and the application processor. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information. The mobile terminal 200 may include one or more display screens 290 for displaying images, videos, and the like. The mobile terminal 200 may implement a shooting function through image signal processing (Image Signal Processing, ISP), a camera module 291, an encoder, a decoder, a GPU, a display screen 290, an application processor, and the like. The camera module 291 is used to capture still images or videos, collect light signals through photosensitive elements, and convert them into electrical signals. The ISP is used to process the data fed back by the camera module 291 and convert the electrical signal into a digital image signal. The mobile terminal 200 may implement audio functions through an audio module 270, a speaker 271, a receiver 272, a microphone 273, an earphone interface 274, an application processor, and the like. Such as music playback, recording, etc. The audio module 270 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 270 may also be used to encode and decode audio signals. The speaker 271 is used for converting audio electrical signals into sound signals. The receiver 272 is used for converting audio electrical signals into sound signals. The microphone 273 is used to convert the sound signal into an electrical signal. The earphone interface 274 is used for connecting wired earphones. The keys 294 include a power-on key, a volume key, and the like. The keys 294 may be mechanical keys or touch keys. The mobile terminal 200 may receive key inputs and generate key signal inputs related to user settings and function control of the mobile terminal 400 .

A method for acquiring instruction information and an apparatus for acquiring instruction information according to an exemplary embodiment of the present disclosure will be specifically described below. FIG. 3 shows a schematic flowchart of a method for acquiring instruction information. As shown in FIG. 3 , the method for acquiring instruction information includes at least the following steps: Step S310 : acquire a target image and perform information extraction on the target image to obtain feature information of the target image Step S320: Acquire voice information and identify text information corresponding to the voice information, wherein the voice information is information associated with the target image; Step S330: Generate instruction information according to the text information and the feature information of the target image.

The method for obtaining instruction information in the present disclosure can fuse the feature information of the target image and the voice information associated with the target image to generate instruction information, improve the accuracy of the instruction information, and further enhance the interaction between the user and the mobile terminal experience. In order to make the technical solutions of the present disclosure clearer, the following describes a method for acquiring instruction information.

In step S310, a target image is acquired and information extraction is performed on the target image to obtain feature information of the target image.

In an exemplary embodiment of the present disclosure, the target image may be an image captured in real time by a camera function of the mobile terminal, or may be a local image stored in the mobile terminal. The user can send an image acquisition request to the mobile terminal, and the mobile terminal determines the target image according to the image acquisition request. The image acquisition request may be a shooting request, and the mobile terminal responds to the shooting request and activates the shooting function to collect the target image in real time. The shooting request may be a user triggering a shooting button on the mobile terminal, for example, the user clicks a camera icon on the mobile terminal, and the shooting request may also be a shooting function in which the user wakes up the mobile terminal through a preset voice. The image acquisition request may also be an image selection request. The mobile terminal responds to the image selection request and displays a local image, and responds to a user's trigger operation on the local image, and determines the target image in the local image according to the trigger operation. In addition, the target image may be one or more. For example, the mobile terminal enables the shooting function, collects the target video through the camera module, acquires a video frame in the target video every preset time period, and uses the acquired multiple video frames as the target image. The preset time period may be set according to the actual situation, for example, a video frame may be acquired every 30ms in the target video, which is not specifically limited in the present disclosure.

In an exemplary embodiment of the present disclosure, the feature information of the target image includes object information of each object in the target image and/or image parameter information of the target image. The object information includes object category and object position, and the image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness. In an exemplary embodiment of the present disclosure, object extraction is performed on the target image, and object information of each object in the target image is acquired. The object information includes object category and object location.

Specifically, object extraction is performed on the target image through a target detection model or an image segmentation model, and the object category and object position of each object in the target image are acquired. Among them, the target detection model can be Faster R-CNN model, RetinaNet model or YOLO model, etc., and the image segmentation model can be DeepLab-V3 model, RefineNet model or PSPNet model, etc. In addition, the saliency detection model can also be used to extract objects from the target image to obtain the object positions of each object in the target image. The saliency detection model may be a saliency detection model based on a spectral residual method, or a saliency detection model based on global contrast, which is not specifically limited in the present disclosure.

After the object position of each object in the target image is obtained through the saliency detection model, the object category of each object is determined according to the object position of each object. The detailed process of determining the object category of each object is as follows: first, the target image is cropped according to the object position of each object to obtain the sub-target image corresponding to each object; then, feature extraction is performed on the sub-target image corresponding to each object to obtain Obtain the feature vector of each object; finally, determine the second predicted category corresponding to the feature vector of each object according to the second preset mapping relationship, and configure the second predicted category corresponding to the feature vector of each object as the object category of each object. Wherein, the second prediction mapping relationship includes the association relationship between the feature vector and the second prediction category. In addition, a plurality of target image samples are acquired in advance, and a binary classification model, a target detection model, an image segmentation model, and a saliency detection model are respectively trained according to the plurality of target image samples. Wherein, the target image sample can be an image marked with a rectangular frame or a mask.

In an exemplary embodiment of the present disclosure, before the object extraction is performed on the target image and the object information of each object in the target image is obtained, whether there is an object in the target image can be determined by the pixel value of each pixel in the target image. If the pixel value of each pixel in the target image is the same, it is determined that there is no object in the target image, and if the pixel value of each pixel in the target image is different, it is determined that one or more objects exist in the target image. In addition, whether there is an object in the target image can also be determined according to the binary classification model. When an object exists in the target image, the target image is subjected to object extraction, and the object information of each object is obtained. In addition, it can also be pre-determined whether there is an object in the picture captured by the camera module, and when there is an object in the picture, the target image or the target video can be acquired in real time.

In an exemplary embodiment of the present disclosure, FIG. 4 shows a schematic flowchart of a method for acquiring object information. As shown in FIG. 4 , the process may include at least steps S410 to S430, and the details are as follows: In step S410, acquiring The object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category. In an exemplary embodiment of the present disclosure, the target image is input into the target detection model or the image segmentation model to obtain the object position of each object in the target image, the first predicted category of each object, and the first predicted category corresponding to the first predicted category. a confidence level. The object position of the object may include the position coordinates of the object in the target image. Specifically, the object position may be the position coordinate set of the detection frame where the object is located, and the position coordinate set includes the start coordinates and the end coordinates in the horizontal direction. , and the starting coordinates and ending coordinates in the vertical direction; the object position can also be the starting coordinate point of the detection frame where the object is located, and the size of the detection frame, and the starting coordinate point includes the starting coordinate in the horizontal direction. and the starting coordinates in the vertical direction, the size of the detection frame includes the size in the horizontal direction and the size in the vertical direction. The first confidence level represents the probability that the first predicted category of the object is the real object category of the object.

In step S420, the feature vector of each object is obtained according to the object position, and the second predicted category of each object and the second confidence level corresponding to the second predicted category are determined according to the second preset mapping relationship. In an exemplary embodiment of the present disclosure, the target image is cropped according to the position of the object to obtain sub-target images corresponding to each object; and feature extraction is performed on the sub-target images to obtain feature vectors of each object. Specifically, the sub-target image is input into the feature extraction model to obtain the feature vector corresponding to the sub-target image. The feature extraction model may be a color histogram model, through which the color features of the sub-target images are extracted, or a local binary pattern (Local Binary Pattern, LBP) model or a gray level co-occurrence matrix model, through the LBP model Or the gray level co-occurrence matrix model is used to extract the image local texture features of the sub-target image, or the Canny operator edge detection or the Sobel operator edge detection model, and the sub-target image is extracted through the Canny operator edge detection or the Sobel operator edge detection model. edge features, etc. In addition, the feature extraction model may also be a combination of two or more of the color histogram model, the LBP model or the gray level co-occurrence matrix model, the Canny operator edge detection model or the Sobel operator edge detection model. Thereby, the feature vector of each object is constructed by one or more features of the color feature, local texture feature, and edge feature of the sub-target image. Also, the sub-target image samples are acquired in advance, and the feature extraction model is trained by using the sub-target image samples. Wherein, the sub-target image sample is an image including only a single object.

In an exemplary embodiment of the present disclosure, the second preset mapping relationship includes an association relationship between the feature vector and the second prediction category. The feature vector of the object is respectively matched with one or more feature vectors in the second preset mapping relationship, and the matching degree between the feature vector of the object and the feature vector in the second preset mapping relationship is obtained. The second predicted category corresponding to the feature vector in the second preset mapping relationship with the largest matching degree is configured as the second predicted category of the object, and the matching degree is configured as the second confidence degree.

In step S430, the object category of each object is determined according to the first predicted category and the second predicted category, and the first confidence level corresponding to the first predicted category and the second confidence level corresponding to the second predicted category. Specifically, FIG. 5 shows a schematic flow chart of a method for determining an object category. As shown in FIG. 5 , the flow includes at least steps S510 to S530. The details are as follows: In step S510, the first prediction category and the second prediction category are determined. whether the categories are the same. In an exemplary embodiment of the present disclosure, the category identifiers corresponding to the first predicted category and the second predicted category may be compared; if the category identifier corresponding to the first predicted category is the same as the category identifier corresponding to the second predicted category, then It is determined that the first predicted category is the same as the second predicted category; if the category identifier corresponding to the first predicted category is different from the category identifier corresponding to the second predicted category, it is determined that the first predicted category is different from the second predicted category. In step S520, when the first predicted class is the same as the second predicted class, the first predicted class or the second predicted class is configured as the object class of each object. In an exemplary embodiment of the present disclosure, since the first predicted category is the same as the second predicted category, the first predicted category or the second predicted category may be configured as the object category of each object. In step S530, when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result. In an exemplary embodiment of the present disclosure, when the first confidence level is greater than the second confidence level, the first prediction category is configured as the object category of each object; when the first confidence level is less than or equal to the second confidence level, the first prediction category is configured as the object category of each object; The second prediction category is configured as the object category of each object.

In an exemplary embodiment of the present disclosure, it can also be determined whether the first confidence level corresponding to the first prediction category and the second confidence level corresponding to the second prediction category are greater than or equal to the confidence threshold; if the first confidence level is greater than or equal to the confidence level If the second confidence level is greater than or equal to the confidence level threshold, and the second confidence level is less than the confidence level threshold, the first prediction category will be used as the object category of the object; if the second confidence level is greater than or equal to the confidence level threshold, and the first confidence level is less than the confidence level threshold, the The second prediction category is used as the object category of the object; if both the first confidence degree and the second confidence degree are greater than or equal to the confidence threshold, the object category of each object is determined according to the above embodiment; if the first confidence degree and the second confidence degree are both less than or equal to If the confidence threshold is set, the object and the object information corresponding to the object can be discarded.

In an exemplary embodiment of the present disclosure, information extraction may be performed on the target image to obtain image parameter information of the target image. The image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness. Specifically, the image parameter information of the target image can be determined according to the shooting parameters by acquiring the shooting parameters of the camera module, or the image parameter information can be determined according to the EXIF information by acquiring the exchangeable image file information (EXIF information) of the target image.

Continuing to refer to FIG. 3 , in step S320, voice information is acquired and text information corresponding to the voice information is recognized, wherein the voice information is information associated with the target image. In an exemplary embodiment of the present disclosure, the voice information may be information associated with the target image. That is to say, in the time period when the mobile terminal acquires the target image or the time period when the mobile terminal displays the target image, the recording function can be enabled to collect the user's voice information in real time. For example, in response to a user's video shooting request, the mobile terminal enables the shooting function and the recording function at the same time, obtains the target video, and obtains multiple target images and voice information in the target video. The video shooting request may be formed by a user's triggering operation on the camera, or may be formed by a user's triggering operation on the scanning function in the smart assistant. Terminal The mobile terminal may obtain the function authority of the intelligent assistant in advance, so that when the user triggers the scanning function of the intelligent assistant, the camera function and the recording function are enabled. For another example, the mobile terminal acquires the target image in advance, and stores the target image in the internal memory or the external memory. The mobile terminal acquires the target image in the internal memory or external memory according to the user's request and displays it on the display screen. During the display of the target image, the recording function is enabled through the user's recording request, and the user's voice information is collected in real time. The video shooting request and the recording request may be formed by the user triggering the video shooting button or the recording button on the mobile terminal. For example, when the user triggers the camera icon or the recording icon on the mobile terminal, the mobile terminal starts the video shooting function or the recording function. . In addition, the video shooting request and the recording request may also be formed by the user waking up the video shooting function or the recording function of the mobile terminal through a preset voice, and the preset voice may be the voice information set by the user, or the mobile terminal. The preset voice information is not specifically limited in this disclosure.

In an exemplary embodiment of the present disclosure, after acquiring the voice information of the user, text information corresponding to the voice information is recognized. The process of recognizing the text information corresponding to the voice information is as follows: First, the voice information is preprocessed. Specifically, the preprocessing may include frame segmentation processing, windowing processing, pre-emphasis processing, and the like. For example, pre-emphasis is performed on the speech sequence corresponding to the speech information to increase the high-frequency resolution of the speech sequence; the pre-emphasized speech sequence is then framed to obtain multiple speech subsequences; The speech subsequences are subjected to windowing processing, and the windowing processing may include multiplying each speech subsequence with a window function, wherein the window function may be a rectangular window, a Hamming window, or a Hanning window. Then, voice feature extraction is performed on the preprocessed voice information to obtain voice features corresponding to the voice information. Specifically, the characteristic parameters of the speech information include: Mel Frequency Cepstrum Coefficient (MFCC, Mel Frequency Cepstrum Coefficient), Linear Prediction Cepstrum Coefficient (LPCC, Linear Prediction Cepstrum Coefficient), Line Spectrum Frequency (LSF, Linear Spectrum Frequency), Wavelet Transform Coefficient (WTC, Wavelet Transform Coefficient) and so on. The voice feature extraction can extract one or more feature parameters of the voice information, and use the one or more feature parameters as the voice features corresponding to the voice information. Finally, the voice feature can be matched with the voice feature template to obtain text information corresponding to the voice feature. Specifically, the voice features corresponding to the voice information can be matched with a plurality of voice feature samples in the voice feature template respectively; when the voice features corresponding to the voice information are matched with the voice feature samples of the voice feature template, the The text information sample corresponding to the voice feature sample is configured as the text information corresponding to the voice information.

In addition, if the voice feature corresponding to the voice information and the voice feature sample of the voice feature template cannot completely match, obtain the matching degree between the voice feature sample and the voice feature, and configure the text information sample corresponding to the voice feature sample with the largest matching degree. It is the text information corresponding to the voice information. The voice feature template includes multiple voice feature samples and text information samples corresponding to each voice feature sample. In addition, the process of constructing the voice feature template includes: first, acquiring a plurality of text information samples, and acquiring the voice information corresponding to the text information samples; then, acquiring the voice feature samples corresponding to the voice information of the text information samples according to the above process; A voice feature template is constructed by the mapping relationship between the text information samples and the voice feature samples corresponding to the text information samples.

In the exemplary embodiment of the present disclosure, when acquiring or displaying the target image or target video, the recording function is enabled, and the recording function is used to determine whether there is voice information, or there is voice information, and the voice information is being collected. In an exemplary embodiment of the present disclosure, after the text information corresponding to the voice information is obtained, word segmentation processing is performed on the text information to obtain one or more keywords. The word segmentation process can include the following two methods: The first one is word segmentation based on a dictionary. Divide the text information into multiple words according to the dictionary, and then combine the multiple words. Among them, a dictionary can be pre-built, and words in the dictionary can be marked according to different parts of speech. After the text information is divided into one or more keywords according to the dictionary, the part of speech of each keyword is also obtained according to the part of speech of each word in the dictionary. In addition, word segmentation processing can also be performed on the text information by using a dictionary that does not mark parts of speech, and after the word segmentation processing, part-of-speech recognition is performed on each keyword. The keywords corresponding to the text information may include entity participles, descriptive participles, verb participles, etc. according to different parts of speech. The entity participle represents a real object or a word that refers to a real object, such as a noun participle, a pronoun participle, which can be "flower", "clothes", "you", etc.; the description participle represents the relationship between objects or is used for Words that describe items, such as adjective participles and adverbial participles, can be "left side", "beautiful", "so dark" and so on. The second is word-based segmentation. Divide text information into multiple words, and then combine multiple words into words, and combine multiple words according to the dictionary. Certainly, a word segmentation algorithm based on statistics can also be used to perform word segmentation processing on the text information, and the present disclosure does not specifically limit the word segmentation processing algorithm.

Continuing to refer to FIG. 3 , in step S330 , instruction information is generated according to the text information and the feature information of the target image. In an exemplary embodiment of the present disclosure, the feature information of the target image is the object category and object position of each object in the target image. FIG. 6 shows a schematic flowchart of a method for generating instruction information. As shown in FIG. 6 , the process includes at least steps S610 to S620. The details are as follows: In step S610, each object information and text The matching result determines the target object. In an exemplary embodiment of the present disclosure, the object category and object position of each object are respectively matched with the entity segmentation and description segmentation in the text information; the object category and object location of the object and the entity segmentation and description in the text information When the word segmentation matches, the object is determined as the target object.

Specifically, FIG. 7 shows a schematic flowchart of a method for determining a target object according to a matching result. As shown in FIG. 7 , the process includes at least steps S710 to S740, and the details are as follows: In step S710, according to the object category and the object position Determine object topology. In an exemplary embodiment of the present disclosure, the object position includes the position coordinates of each object in the target image, and the position coordinates of any two objects are subtracted to obtain the relative position relationship between the two objects. The object category of each object is used as a label, and the object topology relationship is generated according to the relative positional relationship between the objects. The object topological relationship includes the object type of each object, the object position of each object, and the relative positional relationship between each object. This step may also be performed after acquiring the object information of each object, which is not specifically limited in the present disclosure. In step S720, the object category matching the entity word segmentation is determined as the target object category. In an exemplary embodiment of the present disclosure, an entity word in the text information is matched with the object category of each object, and when the object category of the object matches the entity word, the object category of the object is determined as the target object category. In this embodiment, the entity word segmentation in the voice information is used to screen multiple object categories. If there is one or more object categories in the text information corresponding to the voice information, the one or more object categories are determined as the target object category. In step S730, candidate objects are determined among the objects according to the target object category. In an exemplary embodiment of the present disclosure, in the target image, one or more objects may correspond to the same object category. The target object category filtered out according to the previous embodiment can further screen multiple objects in the target image, and when the object category corresponding to the object is the target object category, the object is determined as a candidate object. In step S740, the target object is determined from the candidate objects according to the object topological relationship and the description word segmentation corresponding to the candidate objects. In an exemplary embodiment of the present disclosure, the object topology relationship corresponding to the candidate object is determined in the object topology relationship, the description word segmentation in the text information is matched with the object topology relationship corresponding to the candidate object, and the description word segment corresponds to the candidate object. When the object topological relationship matches, the candidate object is determined as the target object.

In this embodiment, one or more candidate objects are screened by using the description word segmentation in the speech information, so that the target object is determined among the candidate objects, and the accuracy of information acquisition is improved. In addition, the number of candidate objects can also be acquired, and when the number of candidate objects is single, the candidate object is directly determined as the target object, which improves the efficiency of information acquisition and reduces system consumption. It should be noted that, step S710 may be performed before step S720, may be performed after step S730, or may be performed simultaneously with step S720 and step S730, which is not specifically limited in the present disclosure.

In an exemplary embodiment of the present disclosure, line-of-sight information can also be acquired and a gaze position corresponding to the line-of-sight information can be determined. The gaze position may be a gaze point or a gaze area on a two-dimensional plane. Wherein, the sight line information is generated for the target image, and the sight line information is acquired in real time in the process of shooting the target image. Since the target image may include multiple pieces, the line of sight information also includes multiple pieces, and the relationship between the target image and the line-of-sight information is determined according to the shooting time of the target image and the acquisition time of the line-of-sight information. The user's line-of-sight information for the target image can be obtained through the camera module or smart screen of the mobile terminal. For example, the user's line-of-sight information for the target image can be obtained in real time through the built-in camera module in the smart helmet or glasses. Specifically, the line of sight information may also include a left eye image, a right eye image, a face image, and a face position. The face image may provide head posture information, and the face position may provide eye position information. The line of sight information is used as an input. The point estimation algorithm determines the gaze point corresponding to the sight line information. In addition, the head picture and the head position may also be used as input to determine the corresponding gaze area, etc. The present disclosure does not specifically limit the acquisition of the gaze position.

In the exemplary embodiment of the present disclosure, the target object determined in the above embodiment can be more accurately screened by using the gaze area corresponding to the sight line information, so as to determine the target object that the user pays most attention to. Fig. 8 shows a schematic flow chart of a method for determining a target object from candidate objects. As shown in Fig. 8, the flow includes at least steps S810 to S830, and the details are as follows: In step S810, according to the object topology relationship corresponding to the candidate objects As well as description word segmentation, candidate target objects are determined from candidate objects. In an exemplary embodiment of the present disclosure, in the topological relationship between candidate objects, if there are multiple candidate objects matching the description word segmentation, the multiple candidate objects are used as candidate target objects. In step S820, the object position of the candidate target object is matched with the gaze position. In an exemplary embodiment of the present disclosure, the object position of the candidate target object is obtained, and the object position of the candidate target object is matched with the gaze position. If the fixation position is a fixation point, then determine whether the fixation point is within the detection frame determined by the object position of each candidate target object; if the fixation position is a fixation area, calculate the distance between the fixation area and the object position of each candidate target object. Determine the degree of coincidence between the detection boxes. In step S830, when the object position of the candidate target object matches the gaze position, the candidate target object is determined as the target object. In an exemplary embodiment of the present disclosure, if the gaze point is within the detection frame corresponding to the candidate target object, it is determined that the object position of the candidate target object matches the gaze position, and the candidate target object is determined as the target object. In addition, the candidate target object corresponding to the detection frame with the greatest degree of coincidence with the gaze area may also be obtained, and the target candidate object may be determined as the target object.

In an exemplary embodiment of the present disclosure, after acquiring the object information of each object in the target image, the text information corresponding to the voice information, and the gaze position corresponding to the realization information. It is also possible to first determine the candidate objects among the objects according to the object information and the gaze position of each object, and then determine the target object from the candidate objects according to the text information.

Specifically, FIG. 9 shows a schematic flowchart of a method for determining a target object. As shown in FIG. 9 , the process includes at least steps S910 to S930, and the details are as follows: In step S910, the position of the object matching the gaze position is determined. is the target object location. In an exemplary embodiment of the present disclosure, the object positions of the objects in the target image are respectively matched with the gaze positions, and when the object positions of the objects match the gaze positions, the object positions of the objects are determined as the target object positions. Since the object positions of each object have overlapping areas, the gaze position may match with multiple object positions, and there may be multiple determined target object positions. In step S920, a candidate object is determined among the objects according to the target object position. In an exemplary embodiment of the present disclosure, according to the target object position, an object corresponding to the target object position is determined as a candidate object. In step S930, the object information of each candidate object is matched with the text information, and the target object is determined according to the matching result. In the exemplary embodiment of the present disclosure, the above-mentioned embodiments screen a plurality of objects according to the gaze position, and determine candidate objects. After the candidate object is determined, the object information of the candidate object is matched with the text information, and the target object is determined in the candidate object. Specifically, first, each candidate object that matches the entity word segmentation in the text information is determined as the target object category; then, the candidate object is determined in each candidate object according to the target object category; finally, according to the object topology corresponding to the candidate object Relationships and descriptive word segmentation in text information to determine target objects from candidate objects. The object topological relationship may be determined according to the object category and object position of each object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of each object. The object topological relationship of the candidate object may also be determined according to the object category and object position of the candidate object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of the candidate object. In addition, the detailed process of determining the target object in the candidate object according to the object information and text information of the candidate object is as described in the method embodiment of FIG. 7 above, and will not be repeated here.

In the method for acquiring instruction information in the exemplary embodiment of the present disclosure, the feature information of three modalities of sight information, speech and voice information, and feature information of the target image are fused to determine the instruction information, which further improves the ability to pass the voice information and the target image. The accuracy of the instruction information determined by the feature information of the image.

Continuing to refer to FIG. 6 , in step S620, instruction information is generated according to the object information of the target object. Specifically, FIG. 10 shows a schematic flowchart of another method for generating instruction information. As shown in FIG. 10 , the process includes at least steps S1010 to S1020, and the details are as follows: In step S1010, the user intent information is determined according to the text information. . In an exemplary embodiment of the present disclosure, user intention information of a user may be identified through text information. Perform word segmentation processing on the text information to obtain one or more keywords; and determine user intent information corresponding to the keywords according to the first preset mapping relationship. Specifically, one or more keywords corresponding to the text information are respectively matched with the keywords in the first preset mapping relationship, and the keywords in the first preset mapping relationship that match the keywords in the text information are obtained. Corresponding user intent information. The verb participle and/or the adjective participle in the text information can also be matched with the first preset mapping relationship, so as to improve the obtaining efficiency of the user intention information. The first preset mapping relationship includes an association relationship between keywords and user intent information. One keyword may correspond to multiple user intent information, and one user intent information may also correspond to multiple keywords. For example, if the keyword is "buy, want, and like it", the corresponding user intent information may be "obtain a purchase link"; if the keyword is "what", the corresponding user intent information may be "inquire about detailed information, obtain a purchase link" ”; the keyword is “too dark, hard to see”, and the corresponding user intent information can be “adjust the brightness of the image, adjust the contrast of the image” and so on. In step S1020, instruction information is generated according to the object information of the target object and the user's intention information. In an exemplary embodiment of the present disclosure, a sub-target image corresponding to the target object is obtained according to object information of the target object, and instruction information is generated according to the object category of the target object, the sub-target image of the target object, and user intent information. In an exemplary embodiment of the present disclosure, an object acquisition path related to the object information of the target object is acquired and displayed according to the user intent information in the instruction information. Wherein, the object acquisition path of the target object can be queried according to the object category of the target object and/or the sub-target image of the target object, and the object acquisition path can be displayed on the display screen of the mobile terminal. For example, if the user's intention information is to obtain a purchase link, the object category and/or sub-target image of the target object can be input into the purchase platform, and the purchase link returned by the purchase platform can be obtained. In addition, the object detail information related to the object information of the target object can also be acquired and displayed according to the user intention information in the instruction information. Or acquire and display the object acquisition path object detail information related to the object information of the target object according to the user intent information in the instruction information. In the method for obtaining instruction information in this exemplary embodiment, in the case where the user cannot clearly express the demand only through the voice information or the target image, this scheme integrates the feature information of the target image and the text information corresponding to the voice information to obtain the instruction information, Then, according to the instruction information, the information that the user is interested in is recommended for the user. The present exemplary embodiment can more accurately determine user instruction information, thereby providing users with more accurate recommendation information, and improving the interaction experience between the user and the mobile terminal.

In an exemplary embodiment of the present disclosure, when the feature information of the target image is image parameter information of the target image, the user intent information is determined according to the text information, and the parameter adjustment information is generated according to the user intent information and the image parameter information. Wherein, when the feature information of the target image is image parameter information of the target image, the instruction information at this time may be parameter adjustment information, and the mobile terminal may adjust the parameters of the target image according to the parameter adjustment information, and display the parameter adjustment on the display screen. post target image. For example, the image parameter information corresponding to the target image is "brightness value of 65", the text information corresponding to the user's voice information is "shooting is so dark", and the user intent information identified according to the text information is "improve the brightness of the image". ". The instruction information generated according to the instruction information acquisition method in the above-mentioned embodiment may be "adjust the brightness of the target image, and increase the brightness value of the target image to 65+N". Wherein, N is a positive integer, and the value of N can be set according to actual scenarios, which is not specifically limited in the present disclosure.

In an exemplary embodiment of the present disclosure, when the feature information of the target image includes object information of each object in the target image, and image parameter information of the target image. The parameter adjustment information can be generated according to the text information and the image parameter information of the target image; the target object can be determined according to the object information and text information of each object in the target image; the instruction information can be generated according to the parameter adjustment information and the object information of the target object. Specifically, the method of generating parameter adjustment information according to text information and image information of the target image, and determining the target object according to the object information and text information of each object in the target image has been described in detail in the above embodiments, and will not be repeated here. . Among them, the sub-target image corresponding to the target object can be obtained according to the object position of the target object, and then the parameters of the sub-target image of the target object can be adjusted according to the parameter adjustment information, and the target image after parameter adjustment or the target object after parameter adjustment can be displayed. sub-target image. In addition, according to the sub-target image of the target object adjusted by the parameters, the object acquisition path and object detailed information of the target object can be acquired, and the object acquisition path and object detailed information of the target object can be displayed.

In the exemplary embodiment of the present disclosure, if there are multiple target images, the candidate instruction information corresponding to each target image may be determined according to the method in the above embodiment, and then the instruction information may be determined according to the instruction information of each target image. Specifically, FIG. 11 shows a schematic flow chart of another method for generating instruction information. As shown in FIG. 11 , the flow includes at least steps S1110 to S1130, and the details are as follows: In step S1110, according to the characteristics of each target image The information and text information determine candidate instruction information corresponding to each target image. In an exemplary embodiment of the present disclosure, multiple target images are derived from collected target videos, and feature information of each target image is acquired respectively, voice information in the target video is acquired, and text information in the voice information is recognized. The multiple target images may correspond to one piece of voice information, or may correspond to multiple pieces of voice information, and multiple candidate instruction information is determined according to the feature information of each target image and the text information of the voice information corresponding to each target image, respectively. In step S1120, when the candidate instruction information corresponding to each target image is the same, the candidate instruction information is configured as the instruction information. In the exemplary embodiment of the present disclosure, the candidate instruction information corresponding to each target image is matched, if the candidate instruction information corresponding to each target image completely matches, or the degree of matching between each candidate instruction information is greater than the matching degree If the degree threshold is set, any candidate instruction information can be configured as instruction information. The matching degree threshold may be set according to the actual situation, for example, the matching degree threshold may be set to 99%, or may be set to 99.5%, etc., which is not specifically limited in the present disclosure. In step S1130, when the candidate instruction information corresponding to each target image is different, the instruction information is determined according to the confidence level corresponding to each candidate instruction information. In the exemplary embodiment of the present disclosure, the confidence level corresponding to each candidate instruction information may be the confidence level corresponding to the user intent information in the candidate instruction information, the confidence level corresponding to the target object, or the confidence level corresponding to the target object. The product of the confidence level corresponding to the user intent information and the confidence level corresponding to the target object, etc., is not specifically limited in the present disclosure. The confidence level corresponding to the user intent information may be the degree of matching between the keywords in the text information and the keywords in the first preset mapping relationship, and the confidence level corresponding to the target object may be the object category or the object position of the target object. The confidence of the target object, etc., can also be the matching degree between the object information of the target object and the text information. In addition, in the instruction information acquisition method provided by the exemplary embodiment of the present disclosure, the above-mentioned voice information, target image or target video, and sight line information can also be acquired through an intelligent assistant, and the intelligent assistant can be a mobile terminal running on a mobile terminal. application. At the same time, for the convenience of operation, the quick-start smart assistant function can be preset. For example, when the mobile terminal is in the screen-off state, the smart assistant can be entered by clicking the power button three times. In addition, other shortcuts can also be used to enter the smart assistant, which is not specifically limited in the present disclosure. The instruction information acquisition method of this embodiment can activate the intelligent assistant on the mobile terminal through a shortcut, which simplifies the tedious steps of launching the intelligent assistant, and makes the intelligent assistant activation more intelligent, rapid, convenient and accurate.

The method for acquiring instruction information in this example embodiment will be described in detail below in combination with specific scenarios. FIG. 12 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 12 : in step S1201 , obtain the target image, perform object extraction on the target image, and obtain the object information of each object, wherein the object information includes the object category and the object position; In step S1203, determine the object topology relationship according to the object category and the object position of each object; In step S1205, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1207, word segmentation processing is performed on the text information to obtain a or multiple keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1209, the object category that matches the entity word segmentation is determined as the target object category; in step S1211, according to the target object category is determined in each object candidate object; in step S1213, the target object is determined from the candidate objects according to the object topological relationship and description word segmentation corresponding to the candidate object; in step S1215, the user intention information is determined according to the text information; in step S1217, according to the target object's Object information and user intent information generate instruction information.

The method for acquiring instruction information in this example embodiment will be described in detail below in conjunction with another specific scenario. FIG. 13 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 13 : in step In S1301, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1303, acquire voice information associated with the target image, and determine Text information corresponding to the voice information, wherein the text information includes entity word segmentation and description word segmentation; in step S1305, word segmentation processing is performed on the text information to obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; In step S1307, the sight line information associated with the target image is obtained, and the gaze position corresponding to the sight line information is determined; in step S1309, the object position matching the gaze position is determined as the target object position; in step S1311, according to the target The object position determines the candidate object in each object; in step S1313, the candidate object topology relationship is determined according to the object category and object position of each candidate object; in step S1315, the object category that matches the entity word segmentation is determined as the target Object category; in step S1317, a candidate object is determined from each candidate object according to the target object category; in step S1319, a target object is determined from the candidate objects according to the object topological relationship and description word segmentation corresponding to the candidate object; in step S1321 , the user intention information is determined according to the text information; in step S1323 , instruction information is generated according to the object information of the target object and the user intention information.

The method for acquiring instruction information in this exemplary embodiment will be described in detail below in conjunction with another specific scenario. FIG. 14 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 14 : in step In S1401, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1403, determine object topology according to the object category and object position of each object In step S1405, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1407, word segmentation processing is performed on the text information to Obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1409, obtain the line of sight information associated with the target image, and determine the gaze position corresponding to the line of sight information; in step S1411, with The object category matched by the entity word segmentation is determined as the target object category; in step S1413, a candidate object is determined in each object according to the target object category; in step S1415, according to the object topological relationship corresponding to the candidate object and the description word segmentation, from the candidate object Determine the candidate target object; in step S1417, match the object position of the candidate target object with the gaze position; in step S1419, when the object position of the candidate target object matches the gaze position, determine the candidate target object as the target object ; In step S1421, determine the user intent information according to the text information; in step S1423, generate instruction information according to the object information of the target object and the user intent information.

For example, the target image is shown in FIG. 15 . The target image 1500 is identified according to the target detection algorithm. The target image 1500 includes 4 objects, and the object categories of the 4 objects are "flower", "flower pot 1", "flower" Pot 2", "water dispenser"; the text information corresponding to the user's voice information is "I want to buy the flower pot on the left", and the user's intention information is identified through the text information as "obtain the purchase path". And the gaze position 1501 of the user on the target image 1500 is acquired. After acquiring the object category and object position of the object in the target image 1500, the topological relationship of the object is determined according to the object category and object position as follows: "the leftmost of the image is a potted flower" "the bottom of the flower is the flowerpot 1" "the rightmost of the image is It is the water dispenser" "The left side of the water dispenser is the flower pot 2".

First, after the text information is segmented, the noun segment "flower pot" is obtained, and the adverb segment "left" and "that" are obtained; then, the noun segment is matched with the object category to determine the target object category "flower pot", which will be determined The candidate objects are "flower pot 1" and "flower pot 2", and the object topological relationship corresponding to the candidate object is matched with the adverb participle, and the candidate target objects are "flower pot 1" and "flower pot 2"; The object positions of the objects "flower pot 1" and "flower pot 2" are matched with the gaze positions, and the target object that the user pays most attention to is determined as "flower pot 1". Finally, the instruction information "search for the same item of flowerpot 1" is generated, and the mobile terminal sends the sub-target image corresponding to "flowerpot 1" to the corresponding shopping website, so as to obtain the shopping link of the same item related to "flowerpot 1" .

The apparatus embodiments of the present disclosure are described below, which can be used to execute the above-mentioned command information acquisition method of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the above-mentioned embodiments of the method for obtaining instruction information in the present disclosure. FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.

Referring to FIG. 16 , according to an instruction information acquisition apparatus 1600 according to an embodiment of the present disclosure, the instruction information acquisition apparatus 1600 includes an image information extraction module 1601 , a text information acquisition module 1602 , and an instruction information generation module 1603 . Specifically: the image information extraction module 1601 is used to obtain the target image and perform information extraction on the target image to obtain the feature information of the target image; the text information acquisition module 1602 is used to obtain the voice information and identify the text information corresponding to the voice information, The voice information is information associated with the target image; the instruction information generating module 1603 is configured to generate instruction information according to the text information and feature information of the target image.

In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to perform object extraction on the target image to obtain object information of each object in the target image. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to match each object information with text information respectively, determine the target object according to the matching result, and generate instruction information according to the object information of the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to determine the object topological relationship according to the object category and the object position; determine the object category that matches the entity word segmentation as the target object category; A candidate object is determined from each object; a target object is determined from the candidate objects according to the object topological relationship corresponding to the candidate object and the description word segmentation. The object information includes object category and object location, and the text information includes entity word segmentation and description word segmentation. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to obtain line-of-sight information and determine the gaze position corresponding to the line-of-sight information; according to the object topology relationship corresponding to the candidate object and the description word segmentation, determine the candidate object from the candidate object target object; match the object position of the candidate target object with the gaze position; when the object position of the candidate target object matches the gaze position, determine the candidate target object as the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to determine the position of the object that matches the gaze position as the target object position; determine candidate objects in each object according to the target object position; The object information of the selected object is matched with the text information, and the target object is determined according to the matching result. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate instruction information according to the object information of the target object and the user intent information. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to perform word segmentation processing on the text information to obtain one or more keywords; determine the user intent corresponding to the keywords according to the first preset mapping relationship information, and the first preset mapping relationship includes an association relationship between keywords and user intent information. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to obtain the object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category; Obtain the feature vector of each object, and determine the second predicted category of each object and the second confidence level corresponding to the second predicted category according to the second preset mapping relationship; according to the first predicted category and the second predicted category, and the first predicted category The first confidence level corresponding to the category and the second confidence level corresponding to the second predicted category determine the object category of each object; wherein, the second preset mapping relationship includes an association relationship between the feature vector and the second predicted category. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to crop the target image according to the position of the object to obtain sub-target images corresponding to each object; perform feature extraction on the sub-target images to obtain The eigenvectors of each object. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to determine whether the first predicted category is the same as the second predicted category; when the first predicted category is the same as the second predicted category, the first predicted category The category is configured as the object category of each object; when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to configure the first predicted category as the object category of each object when the first confidence level is greater than the second confidence level; When less than or equal to the second confidence level, the second predicted category is configured as the object category of each object. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to perform information extraction on the target image to obtain image parameter information of the target image. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate parameter adjustment information according to the user intent information and image parameter information. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be configured to generate parameter adjustment information according to the text information and image parameter information of the target image; and determine the target according to the object information and text information of each object in the target image Object; generate instruction information according to parameter adjustment information and object information of the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine alternative instruction information corresponding to each target image according to the feature information and text information of each target image; When the instruction information is the same, the candidate instruction information is configured as the instruction information; when the candidate instruction information corresponding to each target image is different, the instruction information is determined according to the confidence level corresponding to each candidate instruction information. Among them, the target image includes multiple. In an exemplary embodiment of the present disclosure, the instruction information acquisition apparatus may further include an information display module (not shown in the figure), the information display module is configured to acquire and display objects related to the target object according to the user intention information in the instruction information information-related object acquisition path; and/or acquire and display object detailed information related to the object information of the target object according to the user intent information in the instruction information. In an exemplary embodiment of the present disclosure, the information display module may also be configured to perform parameter adjustment on the target image according to the parameter adjustment information, and display the parameter-adjusted target image.

The specific details of each module in the above-mentioned command information obtaining apparatus have been described in detail in the part of the embodiment of the command information obtaining method, and the undisclosed details can be referred to the content of the embodiment of the part of the command information obtaining method, and thus will not be repeated.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a mobile terminal, the program code is used to cause the mobile terminal to execute the above-mentioned procedures in this specification. The steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure, for example, any one or more of the steps in FIGS. 3 to 14 may be performed.

Exemplary embodiments of the present disclosure also provide a program product for implementing the above method, which may adopt a portable compact disc read only memory (CD-ROM) and include program codes, and may be stored on a mobile terminal such as a personal computer run. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and embodiments are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the claims.

Claims

A method for obtaining instruction information, comprising:

Obtaining a target image and performing information extraction on the target image to obtain feature information of the target image; and

acquiring voice information and identifying text information corresponding to the voice information, wherein the voice information is information associated with the target image;

The instruction information is generated according to the text information and the feature information of the target image.
The method for obtaining instruction information according to claim 1, wherein, performing information extraction on the target image to obtain feature information of the target image, comprising:

Object extraction is performed on the target image, and object information of each object in the target image is acquired.
The method for obtaining instruction information according to claim 2, wherein generating instruction information according to the text information and feature information of the target image, comprising:

Matching each of the object information and the text information respectively, and determining the target object according to the matching result;

The instruction information is generated according to object information of the target object.
The method for obtaining instruction information according to claim 3, wherein the object information includes an object category and an object position, and the text information includes an entity word segmentation and a description word segmentation;

Match each of the object information with the text information respectively, and determine the target object according to the matching result, including:

determining an object topological relationship according to the object category and the object position;

Determining the object category that matches the entity word segmentation as the target object category;

determining candidate objects in each of the objects according to the target object category;

The target object is determined from the candidate objects according to the object topological relationship corresponding to the candidate objects and the description word segmentation.
The method for obtaining instruction information according to claim 4, wherein determining the target object from the candidate objects according to the object topological relationship corresponding to the candidate object and the description word segmentation, comprising:

acquiring sight line information and determining a gaze position corresponding to the sight line information;

Determine candidate target objects from the candidate objects according to the object topological relationship corresponding to the candidate objects and the description word segmentation;

matching the object position of the candidate target object with the gaze position;

When the object position of the candidate target object matches the gaze position, the candidate target object is determined as the target object.
The method for obtaining instruction information according to claim 3, wherein the object information is matched with the text information respectively, and the target object is determined according to the matching result, comprising:

Determine the object position matching the gaze position as the target object position;

determining candidate objects in each of the objects according to the target object position;

The object information of each candidate object is matched with the text information, and the target object is determined according to the matching result.
The method for obtaining instruction information according to claim 3, wherein generating the instruction information according to the object information of the target object comprises:

Determine user intent information according to the text information;

The instruction information is generated according to the object information of the target object and the user intention information.
The method for obtaining instruction information according to claim 7, wherein determining the user intent information according to the text information comprises:

Perform word segmentation processing on the text information to obtain one or more keywords;

The user intent information corresponding to the keyword is determined according to a first preset mapping relationship, where the first preset mapping relationship includes an association relationship between the keyword and the user intent information.
The method for obtaining instruction information according to claim 7, wherein the method further comprises:

Acquire and display an object acquisition path related to the object information of the target object according to the user intent information in the instruction information; and/or

The object detail information related to the object information of the target object is acquired and displayed according to the user intention information in the instruction information.
The method for obtaining instruction information according to claim 2, wherein obtaining the object information of each object in the target image comprises:

acquiring the object position of each of the objects, the first predicted category of each of the objects, and the first confidence level corresponding to the first predicted category;

Obtain the feature vector of each of the objects according to the position of the object, and determine the second predicted category of each of the objects and the second confidence level corresponding to the second predicted category according to the second preset mapping relationship;

Determine the object category of each of the objects according to the first predicted category and the second predicted category, as well as the first confidence level corresponding to the first predicted category and the second confidence level corresponding to the second predicted category;

The second preset mapping relationship includes an association relationship between the feature vector and the second prediction category.
The method for obtaining instruction information according to claim 10, wherein obtaining the feature vector of each object according to the position of the object, comprising:

Cropping the target image according to the position of the object to obtain a sub-target image corresponding to each of the objects;

Feature extraction is performed on the sub-target images to obtain feature vectors of each of the objects.
The method for acquiring instruction information according to claim 11, wherein the first prediction category and the second prediction category, and the first confidence level corresponding to the first prediction category and the second prediction The second confidence level corresponding to the category determines the object category of each of the objects, including:

judging whether the first prediction category is the same as the second prediction category;

When the first predicted category is the same as the second predicted category, configuring the first predicted category or the second predicted category as the object category of each of the objects;

When the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each of the objects is determined according to the judgment result.
The method for obtaining instruction information according to claim 12, wherein determining the object category of each of the objects according to the judgment result, comprising:

When the first confidence level is greater than the second confidence level, configuring the first predicted category as an object category of each of the objects;

When the first confidence level is less than or equal to the second confidence level, the second predicted category is configured as an object category of each of the objects.
The method for obtaining instruction information according to claim 1, wherein extracting feature information on the target image comprises:

Information extraction is performed on the target image to obtain image parameter information of the target image.
The method for obtaining instruction information according to claim 14, wherein generating instruction information according to the text information and feature information of the target image, comprising:

Determine user intent information according to the text information;

and generating parameter adjustment information according to the user intention information and the image parameter information.
The method for obtaining instruction information according to claim 15, wherein the method further comprises:

Parameter adjustment is performed on the target image according to the parameter adjustment information, and the parameter-adjusted target image is displayed.
The method for obtaining instruction information according to claim 1, wherein generating instruction information according to the text information and feature information of the target image, comprising:

generating parameter adjustment information according to the text information and image parameter information of the target image; and

Determine the target object according to the object information of each object in the target image and the text information;

The instruction information is generated according to the parameter adjustment information and object information of the target object.
The method for obtaining instruction information according to claim 1, wherein the target image comprises a plurality of;

Generate instruction information according to the text information and the feature information of the target image, including:

Determine the candidate instruction information corresponding to each of the target images according to the feature information and the text information of each of the target images respectively;

When the candidate instruction information corresponding to each of the target images is the same, configure the candidate instruction information as the instruction information;

When the candidate instruction information corresponding to each of the target images is different, the instruction information is determined according to the confidence level corresponding to each of the candidate instruction information.
A device for obtaining instruction information, comprising:

an image information extraction module, used for acquiring a target image and performing information extraction on the target image to obtain feature information of the target image;

a text information acquisition module, configured to acquire voice information and identify text information corresponding to the voice information, wherein the voice information is information associated with the target image;

The instruction information generating module is configured to generate instruction information according to the text information and the feature information of the target image.
A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method for acquiring instruction information according to any one of claims 1 to 18 is implemented.
An electronic device, comprising:

one or more processors;

A storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement any one of claims 1 to 18 The method for acquiring instruction information described in one item.