WO2022193911A1 - Instruction information acquisition method and apparatus, readable storage medium, and electronic device - Google Patents

Instruction information acquisition method and apparatus, readable storage medium, and electronic device Download PDF

Info

Publication number
WO2022193911A1
WO2022193911A1 PCT/CN2022/077138 CN2022077138W WO2022193911A1 WO 2022193911 A1 WO2022193911 A1 WO 2022193911A1 CN 2022077138 W CN2022077138 W CN 2022077138W WO 2022193911 A1 WO2022193911 A1 WO 2022193911A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
target image
target
category
instruction information
Prior art date
Application number
PCT/CN2022/077138
Other languages
French (fr)
Chinese (zh)
Inventor
金越
郭彦东
李亚乾
侯志刚
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2022193911A1 publication Critical patent/WO2022193911A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/041Digitisers, e.g. for touch screens or touch pads, characterised by the transducing means
    • G06F3/0416Control or interface arrangements specially adapted for digitisers

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and in particular, to a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium, and an electronic device.
  • the mobile terminal can realize functions such as voice control and information query through a voice assistant, and can realize functions such as image information acquisition through a visual assistant.
  • voice assistants or visual assistants it is difficult for existing voice assistants or visual assistants to accurately generate user instruction information, resulting in poor user experience.
  • the purpose of the present disclosure is to provide a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium and an electronic device, thereby solving the problem of difficulty in accurately generating instruction information in the related art at least to a certain extent.
  • a method for acquiring instruction information comprising: acquiring a target image and performing information extraction on the target image to obtain feature information of the target image; and acquiring voice information and identifying Text information corresponding to the voice information, wherein the voice information is information associated with the target image; instruction information is generated according to the text information and feature information of the target image.
  • an instruction information acquisition device comprising: an image information extraction module configured to acquire a target image and perform information extraction on the target image to obtain the target image feature information; a text information acquisition module is used to acquire voice information and identify the text information corresponding to the voice information, wherein the voice information is the information associated with the target image; an instruction information generation module is used for according to The text information and the feature information of the target image generate instruction information.
  • a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method for obtaining instruction information as described in the foregoing embodiments.
  • an electronic device comprising: one or more processors; a storage device for storing one or more programs, when the one or more programs are stored by the one or more programs When executed by each processor, the one or more processors are caused to implement the method for acquiring instruction information as described in the foregoing embodiments.
  • FIG. 1 schematically shows a schematic diagram of a system architecture of the present exemplary embodiment
  • FIG. 2 schematically shows a schematic diagram of the electronic device of the present exemplary embodiment
  • FIG. 3 schematically shows a schematic flowchart of a method for acquiring instruction information according to an embodiment of the present disclosure
  • FIG. 4 schematically shows a schematic flowchart of a method for acquiring object information according to an embodiment of the present disclosure
  • FIG. 5 schematically shows a schematic flowchart of a method for determining an object category according to an embodiment of the present disclosure
  • FIG. 6 schematically shows a schematic flowchart of a method for generating instruction information according to an embodiment of the present disclosure
  • FIG. 7 schematically shows a schematic flowchart of determining a target object according to a matching result according to an embodiment of the present disclosure
  • FIG. 8 schematically shows a flow chart of determining a target object from candidate objects according to an embodiment of the present disclosure
  • FIG. 9 schematically shows a schematic flowchart of a method for determining a target object according to an embodiment of the present disclosure
  • FIG. 10 schematically shows a schematic flowchart of another method for generating instruction information according to an embodiment of the present disclosure
  • FIG. 11 schematically shows a flow chart of yet another method for generating instruction information according to an embodiment of the present disclosure
  • FIG. 12 schematically shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure
  • FIG. 13 schematically shows a schematic flowchart of a method for acquiring instruction information according to another specific embodiment of the present disclosure
  • FIG. 14 schematically shows a flow chart of a method for acquiring instruction information according to another specific embodiment of the present disclosure
  • FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • a mobile terminal is installed with an application program with functions of a visual assistant and a voice assistant.
  • the visual assistant mainly captures the visual information of the user's environment, analyzes the visual information presented in the form of pictures or videos, understands the user's environment, objects and the relationship between objects, further understands the user's intention, and provides users with reasonable recommendations .
  • the voice assistant mainly captures the user's voice information, converts the voice information into text, further analyzes the user's intention, and realizes intelligent interaction with the user.
  • the visual assistant when analyzing the visual information presented in the form of pictures or videos, there is an inaccurate judgment on the user's intention, or when there are multiple objects in the picture or video, the user is most concerned about the object judgment. inaccurate question.
  • voice assistants due to noisy ambient background sounds, outdated equipment, resulting in unclear radio reception or unclear meaning of user voices, it is difficult for voice assistants to accurately analyze user intent.
  • an embodiment of the present disclosure first provides a method for acquiring instruction information, and the method for acquiring instruction information is applied to the system architecture of the exemplary embodiment of the present disclosure.
  • FIG. 1 shows a schematic diagram of a system architecture according to an exemplary embodiment of the present disclosure.
  • the system architecture 100 may include: a terminal 110 , a network 120 and a server 130 .
  • the terminal 110 may be various electronic devices with image capturing functions and audio capturing functions, including but not limited to mobile phones, tablet computers, digital cameras, personal computers, and the like.
  • the medium used by the network 120 to provide a communication link between the terminal 110 and the server 130 may include various connection types, such as wired, wireless communication links, or fiber optic cables.
  • the numbers of terminals, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks and servers according to implementation needs.
  • the server 130 may be a server cluster composed of multiple servers, or the like.
  • the method for acquiring instruction information provided by the embodiment of the present disclosure may be executed by the terminal 110, for example, after the terminal 110 acquires the voice information and the target image, the instruction information is generated.
  • the method for obtaining the instruction information provided by the embodiments of the present disclosure may also be executed by the server 130.
  • the terminal 110 obtains the voice information and the target image
  • the voice information and the target image are uploaded to the server 130, so that the server 130 can generate the instruction information. , which is not limited in the present disclosure.
  • Exemplary embodiments of the present disclosure provide an electronic device for implementing a method for acquiring instruction information, which may be the terminal 110 or the server 130 in FIG. 1 .
  • the electronic device includes at least a processor and a memory, the memory is used for storing executable instructions of the processor, and the processor is configured to execute the instruction information acquisition method by executing the executable instructions.
  • Electronic devices can be implemented in various forms, such as mobile phones, tablet computers, notebook computers, personal digital assistants (PDAs), navigation devices, wearable devices, drones and other mobile devices, as well as desktop computers, Fixed devices such as smart TVs.
  • PDAs personal digital assistants
  • navigation devices wearable devices
  • drones drones and other mobile devices
  • desktop computers Fixed devices such as smart TVs.
  • the mobile terminal 200 in FIG. 2 takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes.
  • the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 .
  • the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.
  • the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile Communication module 250, wireless communication module 260, audio module 270, speaker 271, receiver 272, microphone 273, headphone jack 274, sensor module 280, display screen 290, camera module 291, indicator 292, motor 293, buttons 294 and user Identification module (Subscriber Identification Module, SIM) card interface 295 and so on.
  • the sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyro sensor 2803, an air pressure sensor 2804, and the like.
  • the mobile terminal 200 implements a display function through a graphics processor (Graphics Processing Unit, abbreviation: GPU), a display screen 290, an application processor, and the like.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering, and connects the display 290 and the application processor.
  • Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the mobile terminal 200 may include one or more display screens 290 for displaying images, videos, and the like.
  • the mobile terminal 200 may implement a shooting function through image signal processing (Image Signal Processing, ISP), a camera module 291, an encoder, a decoder, a GPU, a display screen 290, an application processor, and the like.
  • ISP Image Signal Processing
  • the camera module 291 is used to capture still images or videos, collect light signals through photosensitive elements, and convert them into electrical signals.
  • the ISP is used to process the data fed back by the camera module 291 and convert the electrical signal into a digital image signal.
  • the mobile terminal 200 may implement audio functions through an audio module 270, a speaker 271, a receiver 272, a microphone 273, an earphone interface 274, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 270 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 270 may also be used to encode and decode audio signals.
  • the speaker 271 is used for converting audio electrical signals into sound signals.
  • the receiver 272 is used for converting audio electrical signals into sound signals.
  • the microphone 273 is used to convert the sound signal into an electrical signal.
  • the earphone interface 274 is used for connecting wired earphones.
  • the keys 294 include a power-on key, a volume key, and the like.
  • the keys 294 may be mechanical keys or touch keys.
  • the mobile terminal 200 may receive key inputs and generate key signal inputs related to user settings and function control of the mobile terminal 400 .
  • FIG. 3 shows a schematic flowchart of a method for acquiring instruction information.
  • the method for acquiring instruction information includes at least the following steps: Step S310 : acquire a target image and perform information extraction on the target image to obtain feature information of the target image Step S320: Acquire voice information and identify text information corresponding to the voice information, wherein the voice information is information associated with the target image; Step S330: Generate instruction information according to the text information and the feature information of the target image.
  • the method for obtaining instruction information in the present disclosure can fuse the feature information of the target image and the voice information associated with the target image to generate instruction information, improve the accuracy of the instruction information, and further enhance the interaction between the user and the mobile terminal experience.
  • the following describes a method for acquiring instruction information.
  • step S310 a target image is acquired and information extraction is performed on the target image to obtain feature information of the target image.
  • the target image may be an image captured in real time by a camera function of the mobile terminal, or may be a local image stored in the mobile terminal.
  • the user can send an image acquisition request to the mobile terminal, and the mobile terminal determines the target image according to the image acquisition request.
  • the image acquisition request may be a shooting request, and the mobile terminal responds to the shooting request and activates the shooting function to collect the target image in real time.
  • the shooting request may be a user triggering a shooting button on the mobile terminal, for example, the user clicks a camera icon on the mobile terminal, and the shooting request may also be a shooting function in which the user wakes up the mobile terminal through a preset voice.
  • the image acquisition request may also be an image selection request.
  • the mobile terminal responds to the image selection request and displays a local image, and responds to a user's trigger operation on the local image, and determines the target image in the local image according to the trigger operation.
  • the target image may be one or more.
  • the mobile terminal enables the shooting function, collects the target video through the camera module, acquires a video frame in the target video every preset time period, and uses the acquired multiple video frames as the target image.
  • the preset time period may be set according to the actual situation, for example, a video frame may be acquired every 30ms in the target video, which is not specifically limited in the present disclosure.
  • the feature information of the target image includes object information of each object in the target image and/or image parameter information of the target image.
  • the object information includes object category and object position
  • the image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness.
  • object extraction is performed on the target image, and object information of each object in the target image is acquired.
  • the object information includes object category and object location.
  • object extraction is performed on the target image through a target detection model or an image segmentation model, and the object category and object position of each object in the target image are acquired.
  • the target detection model can be Faster R-CNN model, RetinaNet model or YOLO model, etc.
  • the image segmentation model can be DeepLab-V3 model, RefineNet model or PSPNet model, etc.
  • the saliency detection model can also be used to extract objects from the target image to obtain the object positions of each object in the target image.
  • the saliency detection model may be a saliency detection model based on a spectral residual method, or a saliency detection model based on global contrast, which is not specifically limited in the present disclosure.
  • the object category of each object is determined according to the object position of each object.
  • the detailed process of determining the object category of each object is as follows: first, the target image is cropped according to the object position of each object to obtain the sub-target image corresponding to each object; then, feature extraction is performed on the sub-target image corresponding to each object to obtain Obtain the feature vector of each object; finally, determine the second predicted category corresponding to the feature vector of each object according to the second preset mapping relationship, and configure the second predicted category corresponding to the feature vector of each object as the object category of each object.
  • the second prediction mapping relationship includes the association relationship between the feature vector and the second prediction category.
  • a plurality of target image samples are acquired in advance, and a binary classification model, a target detection model, an image segmentation model, and a saliency detection model are respectively trained according to the plurality of target image samples.
  • the target image sample can be an image marked with a rectangular frame or a mask.
  • whether there is an object in the target image can be determined by the pixel value of each pixel in the target image. If the pixel value of each pixel in the target image is the same, it is determined that there is no object in the target image, and if the pixel value of each pixel in the target image is different, it is determined that one or more objects exist in the target image. In addition, whether there is an object in the target image can also be determined according to the binary classification model. When an object exists in the target image, the target image is subjected to object extraction, and the object information of each object is obtained. In addition, it can also be pre-determined whether there is an object in the picture captured by the camera module, and when there is an object in the picture, the target image or the target video can be acquired in real time.
  • FIG. 4 shows a schematic flowchart of a method for acquiring object information.
  • the process may include at least steps S410 to S430, and the details are as follows:
  • step S410 acquiring The object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category.
  • the target image is input into the target detection model or the image segmentation model to obtain the object position of each object in the target image, the first predicted category of each object, and the first predicted category corresponding to the first predicted category. a confidence level.
  • the object position of the object may include the position coordinates of the object in the target image.
  • the object position may be the position coordinate set of the detection frame where the object is located, and the position coordinate set includes the start coordinates and the end coordinates in the horizontal direction. , and the starting coordinates and ending coordinates in the vertical direction; the object position can also be the starting coordinate point of the detection frame where the object is located, and the size of the detection frame, and the starting coordinate point includes the starting coordinate in the horizontal direction. and the starting coordinates in the vertical direction, the size of the detection frame includes the size in the horizontal direction and the size in the vertical direction.
  • the first confidence level represents the probability that the first predicted category of the object is the real object category of the object.
  • step S420 the feature vector of each object is obtained according to the object position, and the second predicted category of each object and the second confidence level corresponding to the second predicted category are determined according to the second preset mapping relationship.
  • the target image is cropped according to the position of the object to obtain sub-target images corresponding to each object; and feature extraction is performed on the sub-target images to obtain feature vectors of each object.
  • the sub-target image is input into the feature extraction model to obtain the feature vector corresponding to the sub-target image.
  • the feature extraction model may be a color histogram model, through which the color features of the sub-target images are extracted, or a local binary pattern (Local Binary Pattern, LBP) model or a gray level co-occurrence matrix model, through the LBP model Or the gray level co-occurrence matrix model is used to extract the image local texture features of the sub-target image, or the Canny operator edge detection or the Sobel operator edge detection model, and the sub-target image is extracted through the Canny operator edge detection or the Sobel operator edge detection model. edge features, etc.
  • the feature extraction model may also be a combination of two or more of the color histogram model, the LBP model or the gray level co-occurrence matrix model, the Canny operator edge detection model or the Sobel operator edge detection model.
  • the feature vector of each object is constructed by one or more features of the color feature, local texture feature, and edge feature of the sub-target image.
  • the sub-target image samples are acquired in advance, and the feature extraction model is trained by using the sub-target image samples.
  • the sub-target image sample is an image including only a single object.
  • the second preset mapping relationship includes an association relationship between the feature vector and the second prediction category.
  • the feature vector of the object is respectively matched with one or more feature vectors in the second preset mapping relationship, and the matching degree between the feature vector of the object and the feature vector in the second preset mapping relationship is obtained.
  • the second predicted category corresponding to the feature vector in the second preset mapping relationship with the largest matching degree is configured as the second predicted category of the object, and the matching degree is configured as the second confidence degree.
  • step S430 the object category of each object is determined according to the first predicted category and the second predicted category, and the first confidence level corresponding to the first predicted category and the second confidence level corresponding to the second predicted category.
  • FIG. 5 shows a schematic flow chart of a method for determining an object category. As shown in FIG. 5 , the flow includes at least steps S510 to S530. The details are as follows: In step S510, the first prediction category and the second prediction category are determined. whether the categories are the same.
  • the category identifiers corresponding to the first predicted category and the second predicted category may be compared; if the category identifier corresponding to the first predicted category is the same as the category identifier corresponding to the second predicted category, then It is determined that the first predicted category is the same as the second predicted category; if the category identifier corresponding to the first predicted category is different from the category identifier corresponding to the second predicted category, it is determined that the first predicted category is different from the second predicted category.
  • step S520 when the first predicted class is the same as the second predicted class, the first predicted class or the second predicted class is configured as the object class of each object.
  • the first predicted category or the second predicted category may be configured as the object category of each object.
  • step S530 when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result.
  • the first prediction category when the first confidence level is greater than the second confidence level, the first prediction category is configured as the object category of each object; when the first confidence level is less than or equal to the second confidence level, the first prediction category is configured as the object category of each object; The second prediction category is configured as the object category of each object.
  • the first confidence level corresponding to the first prediction category and the second confidence level corresponding to the second prediction category are greater than or equal to the confidence threshold; if the first confidence level is greater than or equal to the confidence level If the second confidence level is greater than or equal to the confidence level threshold, and the second confidence level is less than the confidence level threshold, the first prediction category will be used as the object category of the object; if the second confidence level is greater than or equal to the confidence level threshold, and the first confidence level is less than the confidence level threshold, the The second prediction category is used as the object category of the object; if both the first confidence degree and the second confidence degree are greater than or equal to the confidence threshold, the object category of each object is determined according to the above embodiment; if the first confidence degree and the second confidence degree are both less than or equal to If the confidence threshold is set, the object and the object information corresponding to the object can be discarded.
  • information extraction may be performed on the target image to obtain image parameter information of the target image.
  • the image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness.
  • the image parameter information of the target image can be determined according to the shooting parameters by acquiring the shooting parameters of the camera module, or the image parameter information can be determined according to the EXIF information by acquiring the exchangeable image file information (EXIF information) of the target image.
  • step S320 voice information is acquired and text information corresponding to the voice information is recognized, wherein the voice information is information associated with the target image.
  • the voice information may be information associated with the target image. That is to say, in the time period when the mobile terminal acquires the target image or the time period when the mobile terminal displays the target image, the recording function can be enabled to collect the user's voice information in real time. For example, in response to a user's video shooting request, the mobile terminal enables the shooting function and the recording function at the same time, obtains the target video, and obtains multiple target images and voice information in the target video.
  • the video shooting request may be formed by a user's triggering operation on the camera, or may be formed by a user's triggering operation on the scanning function in the smart assistant.
  • Terminal The mobile terminal may obtain the function authority of the intelligent assistant in advance, so that when the user triggers the scanning function of the intelligent assistant, the camera function and the recording function are enabled.
  • the mobile terminal acquires the target image in advance, and stores the target image in the internal memory or the external memory.
  • the mobile terminal acquires the target image in the internal memory or external memory according to the user's request and displays it on the display screen.
  • the recording function is enabled through the user's recording request, and the user's voice information is collected in real time.
  • the video shooting request and the recording request may be formed by the user triggering the video shooting button or the recording button on the mobile terminal. For example, when the user triggers the camera icon or the recording icon on the mobile terminal, the mobile terminal starts the video shooting function or the recording function. .
  • the video shooting request and the recording request may also be formed by the user waking up the video shooting function or the recording function of the mobile terminal through a preset voice, and the preset voice may be the voice information set by the user, or the mobile terminal.
  • the preset voice information is not specifically limited in this disclosure.
  • the voice information is preprocessed.
  • the preprocessing may include frame segmentation processing, windowing processing, pre-emphasis processing, and the like.
  • pre-emphasis is performed on the speech sequence corresponding to the speech information to increase the high-frequency resolution of the speech sequence; the pre-emphasized speech sequence is then framed to obtain multiple speech subsequences;
  • the speech subsequences are subjected to windowing processing, and the windowing processing may include multiplying each speech subsequence with a window function, wherein the window function may be a rectangular window, a Hamming window, or a Hanning window.
  • voice feature extraction is performed on the preprocessed voice information to obtain voice features corresponding to the voice information.
  • the characteristic parameters of the speech information include: Mel Frequency Cepstrum Coefficient (MFCC, Mel Frequency Cepstrum Coefficient), Linear Prediction Cepstrum Coefficient (LPCC, Linear Prediction Cepstrum Coefficient), Line Spectrum Frequency (LSF, Linear Spectrum Frequency), Wavelet Transform Coefficient (WTC, Wavelet Transform Coefficient) and so on.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPCC Linear Prediction Cepstrum Coefficient
  • LSF Linear Prediction Cepstrum Coefficient
  • LSF Linear Spectrum Frequency
  • WTC Wavelet Transform Coefficient
  • the voice feature extraction can extract one or more feature parameters of the voice information, and use the one or more feature parameters as the voice features corresponding to the voice information.
  • the voice feature can be matched with the voice feature template to obtain text information corresponding to the voice feature.
  • the voice features corresponding to the voice information can be matched with a plurality of voice feature samples in the voice feature template respectively; when the voice features corresponding to the voice information are matched with the voice feature samples of the voice feature template, the The text information sample corresponding to the voice feature sample is configured as the text information corresponding to the voice information.
  • the voice feature template includes multiple voice feature samples and text information samples corresponding to each voice feature sample.
  • the process of constructing the voice feature template includes: first, acquiring a plurality of text information samples, and acquiring the voice information corresponding to the text information samples; then, acquiring the voice feature samples corresponding to the voice information of the text information samples according to the above process; A voice feature template is constructed by the mapping relationship between the text information samples and the voice feature samples corresponding to the text information samples.
  • the recording function when acquiring or displaying the target image or target video, the recording function is enabled, and the recording function is used to determine whether there is voice information, or there is voice information, and the voice information is being collected.
  • word segmentation processing is performed on the text information to obtain one or more keywords.
  • the word segmentation process can include the following two methods: The first one is word segmentation based on a dictionary. Divide the text information into multiple words according to the dictionary, and then combine the multiple words. Among them, a dictionary can be pre-built, and words in the dictionary can be marked according to different parts of speech.
  • the part of speech of each keyword is also obtained according to the part of speech of each word in the dictionary.
  • word segmentation processing can also be performed on the text information by using a dictionary that does not mark parts of speech, and after the word segmentation processing, part-of-speech recognition is performed on each keyword.
  • the keywords corresponding to the text information may include entity participles, descriptive participles, verb participles, etc. according to different parts of speech.
  • the entity participle represents a real object or a word that refers to a real object, such as a noun participle, a pronoun participle, which can be "flower", “clothes", “you”, etc.; the description participle represents the relationship between objects or is used for Words that describe items, such as adjective participles and adverbial participles, can be "left side", “beautiful", "so dark” and so on.
  • the second is word-based segmentation. Divide text information into multiple words, and then combine multiple words into words, and combine multiple words according to the dictionary.
  • a word segmentation algorithm based on statistics can also be used to perform word segmentation processing on the text information, and the present disclosure does not specifically limit the word segmentation processing algorithm.
  • step S330 instruction information is generated according to the text information and the feature information of the target image.
  • the feature information of the target image is the object category and object position of each object in the target image.
  • FIG. 6 shows a schematic flowchart of a method for generating instruction information. As shown in FIG. 6 , the process includes at least steps S610 to S620. The details are as follows: In step S610, each object information and text The matching result determines the target object.
  • the object category and object position of each object are respectively matched with the entity segmentation and description segmentation in the text information; the object category and object location of the object and the entity segmentation and description in the text information When the word segmentation matches, the object is determined as the target object.
  • FIG. 7 shows a schematic flowchart of a method for determining a target object according to a matching result.
  • the process includes at least steps S710 to S740, and the details are as follows:
  • step S710 according to the object category and the object position Determine object topology.
  • the object position includes the position coordinates of each object in the target image, and the position coordinates of any two objects are subtracted to obtain the relative position relationship between the two objects.
  • the object category of each object is used as a label, and the object topology relationship is generated according to the relative positional relationship between the objects.
  • the object topological relationship includes the object type of each object, the object position of each object, and the relative positional relationship between each object.
  • step S720 the object category matching the entity word segmentation is determined as the target object category.
  • an entity word in the text information is matched with the object category of each object, and when the object category of the object matches the entity word, the object category of the object is determined as the target object category.
  • the entity word segmentation in the voice information is used to screen multiple object categories. If there is one or more object categories in the text information corresponding to the voice information, the one or more object categories are determined as the target object category.
  • step S730 candidate objects are determined among the objects according to the target object category.
  • one or more objects may correspond to the same object category.
  • the target object category filtered out according to the previous embodiment can further screen multiple objects in the target image, and when the object category corresponding to the object is the target object category, the object is determined as a candidate object.
  • the target object is determined from the candidate objects according to the object topological relationship and the description word segmentation corresponding to the candidate objects.
  • the object topology relationship corresponding to the candidate object is determined in the object topology relationship
  • the description word segmentation in the text information is matched with the object topology relationship corresponding to the candidate object
  • the description word segment corresponds to the candidate object.
  • the candidate object is determined as the target object.
  • step S710 may be performed before step S720, may be performed after step S730, or may be performed simultaneously with step S720 and step S730, which is not specifically limited in the present disclosure.
  • line-of-sight information can also be acquired and a gaze position corresponding to the line-of-sight information can be determined.
  • the gaze position may be a gaze point or a gaze area on a two-dimensional plane.
  • the sight line information is generated for the target image, and the sight line information is acquired in real time in the process of shooting the target image. Since the target image may include multiple pieces, the line of sight information also includes multiple pieces, and the relationship between the target image and the line-of-sight information is determined according to the shooting time of the target image and the acquisition time of the line-of-sight information.
  • the user's line-of-sight information for the target image can be obtained through the camera module or smart screen of the mobile terminal.
  • the user's line-of-sight information for the target image can be obtained in real time through the built-in camera module in the smart helmet or glasses.
  • the line of sight information may also include a left eye image, a right eye image, a face image, and a face position.
  • the face image may provide head posture information, and the face position may provide eye position information.
  • the line of sight information is used as an input.
  • the point estimation algorithm determines the gaze point corresponding to the sight line information.
  • the head picture and the head position may also be used as input to determine the corresponding gaze area, etc.
  • the present disclosure does not specifically limit the acquisition of the gaze position.
  • the target object determined in the above embodiment can be more accurately screened by using the gaze area corresponding to the sight line information, so as to determine the target object that the user pays most attention to.
  • Fig. 8 shows a schematic flow chart of a method for determining a target object from candidate objects. As shown in Fig. 8, the flow includes at least steps S810 to S830, and the details are as follows: In step S810, according to the object topology relationship corresponding to the candidate objects As well as description word segmentation, candidate target objects are determined from candidate objects. In an exemplary embodiment of the present disclosure, in the topological relationship between candidate objects, if there are multiple candidate objects matching the description word segmentation, the multiple candidate objects are used as candidate target objects.
  • step S820 the object position of the candidate target object is matched with the gaze position.
  • the object position of the candidate target object is obtained, and the object position of the candidate target object is matched with the gaze position. If the fixation position is a fixation point, then determine whether the fixation point is within the detection frame determined by the object position of each candidate target object; if the fixation position is a fixation area, calculate the distance between the fixation area and the object position of each candidate target object. Determine the degree of coincidence between the detection boxes.
  • step S830 when the object position of the candidate target object matches the gaze position, the candidate target object is determined as the target object.
  • the candidate target object if the gaze point is within the detection frame corresponding to the candidate target object, it is determined that the object position of the candidate target object matches the gaze position, and the candidate target object is determined as the target object.
  • the candidate target object corresponding to the detection frame with the greatest degree of coincidence with the gaze area may also be obtained, and the target candidate object may be determined as the target object.
  • the text information corresponding to the voice information, and the gaze position corresponding to the realization information after acquiring the object information of each object in the target image, the text information corresponding to the voice information, and the gaze position corresponding to the realization information. It is also possible to first determine the candidate objects among the objects according to the object information and the gaze position of each object, and then determine the target object from the candidate objects according to the text information.
  • FIG. 9 shows a schematic flowchart of a method for determining a target object.
  • the process includes at least steps S910 to S930, and the details are as follows:
  • step S910 the position of the object matching the gaze position is determined. is the target object location.
  • the object positions of the objects in the target image are respectively matched with the gaze positions, and when the object positions of the objects match the gaze positions, the object positions of the objects are determined as the target object positions. Since the object positions of each object have overlapping areas, the gaze position may match with multiple object positions, and there may be multiple determined target object positions.
  • a candidate object is determined among the objects according to the target object position.
  • an object corresponding to the target object position is determined as a candidate object.
  • the object information of each candidate object is matched with the text information, and the target object is determined according to the matching result.
  • the above-mentioned embodiments screen a plurality of objects according to the gaze position, and determine candidate objects. After the candidate object is determined, the object information of the candidate object is matched with the text information, and the target object is determined in the candidate object.
  • each candidate object that matches the entity word segmentation in the text information is determined as the target object category; then, the candidate object is determined in each candidate object according to the target object category; finally, according to the object topology corresponding to the candidate object Relationships and descriptive word segmentation in text information to determine target objects from candidate objects.
  • the object topological relationship may be determined according to the object category and object position of each object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of each object.
  • the object topological relationship of the candidate object may also be determined according to the object category and object position of the candidate object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of the candidate object.
  • the detailed process of determining the target object in the candidate object according to the object information and text information of the candidate object is as described in the method embodiment of FIG. 7 above, and will not be repeated here.
  • the feature information of three modalities of sight information, speech and voice information, and feature information of the target image are fused to determine the instruction information, which further improves the ability to pass the voice information and the target image.
  • the accuracy of the instruction information determined by the feature information of the image.
  • step S620 instruction information is generated according to the object information of the target object.
  • FIG. 10 shows a schematic flowchart of another method for generating instruction information.
  • the process includes at least steps S1010 to S1020, and the details are as follows:
  • the user intent information is determined according to the text information.
  • user intention information of a user may be identified through text information. Perform word segmentation processing on the text information to obtain one or more keywords; and determine user intent information corresponding to the keywords according to the first preset mapping relationship.
  • one or more keywords corresponding to the text information are respectively matched with the keywords in the first preset mapping relationship, and the keywords in the first preset mapping relationship that match the keywords in the text information are obtained.
  • Corresponding user intent information The verb participle and/or the adjective participle in the text information can also be matched with the first preset mapping relationship, so as to improve the obtaining efficiency of the user intention information.
  • the first preset mapping relationship includes an association relationship between keywords and user intent information.
  • One keyword may correspond to multiple user intent information, and one user intent information may also correspond to multiple keywords.
  • the corresponding user intent information may be “obtain a purchase link”; if the keyword is "what”, the corresponding user intent information may be “inquire about detailed information, obtain a purchase link” ”; the keyword is “too dark, hard to see”, and the corresponding user intent information can be “adjust the brightness of the image, adjust the contrast of the image” and so on.
  • instruction information is generated according to the object information of the target object and the user's intention information.
  • a sub-target image corresponding to the target object is obtained according to object information of the target object, and instruction information is generated according to the object category of the target object, the sub-target image of the target object, and user intent information.
  • an object acquisition path related to the object information of the target object is acquired and displayed according to the user intent information in the instruction information.
  • the object acquisition path of the target object can be queried according to the object category of the target object and/or the sub-target image of the target object, and the object acquisition path can be displayed on the display screen of the mobile terminal.
  • the object category and/or sub-target image of the target object can be input into the purchase platform, and the purchase link returned by the purchase platform can be obtained.
  • the object detail information related to the object information of the target object can also be acquired and displayed according to the user intention information in the instruction information. Or acquire and display the object acquisition path object detail information related to the object information of the target object according to the user intent information in the instruction information.
  • this scheme integrates the feature information of the target image and the text information corresponding to the voice information to obtain the instruction information, Then, according to the instruction information, the information that the user is interested in is recommended for the user.
  • the present exemplary embodiment can more accurately determine user instruction information, thereby providing users with more accurate recommendation information, and improving the interaction experience between the user and the mobile terminal.
  • the user intent information is determined according to the text information
  • the parameter adjustment information is generated according to the user intent information and the image parameter information.
  • the instruction information at this time may be parameter adjustment information
  • the mobile terminal may adjust the parameters of the target image according to the parameter adjustment information, and display the parameter adjustment on the display screen. post target image.
  • the image parameter information corresponding to the target image is "brightness value of 65”
  • the text information corresponding to the user's voice information is "shooting is so dark”
  • the user intent information identified according to the text information is "improve the brightness of the image”.
  • the instruction information generated according to the instruction information acquisition method in the above-mentioned embodiment may be "adjust the brightness of the target image, and increase the brightness value of the target image to 65+N".
  • N is a positive integer, and the value of N can be set according to actual scenarios, which is not specifically limited in the present disclosure.
  • the feature information of the target image includes object information of each object in the target image, and image parameter information of the target image.
  • the parameter adjustment information can be generated according to the text information and the image parameter information of the target image; the target object can be determined according to the object information and text information of each object in the target image; the instruction information can be generated according to the parameter adjustment information and the object information of the target object.
  • the method of generating parameter adjustment information according to text information and image information of the target image, and determining the target object according to the object information and text information of each object in the target image has been described in detail in the above embodiments, and will not be repeated here. .
  • the sub-target image corresponding to the target object can be obtained according to the object position of the target object, and then the parameters of the sub-target image of the target object can be adjusted according to the parameter adjustment information, and the target image after parameter adjustment or the target object after parameter adjustment can be displayed.
  • sub-target image according to the sub-target image of the target object adjusted by the parameters, the object acquisition path and object detailed information of the target object can be acquired, and the object acquisition path and object detailed information of the target object can be displayed.
  • the candidate instruction information corresponding to each target image may be determined according to the method in the above embodiment, and then the instruction information may be determined according to the instruction information of each target image.
  • FIG. 11 shows a schematic flow chart of another method for generating instruction information. As shown in FIG. 11 , the flow includes at least steps S1110 to S1130, and the details are as follows: In step S1110, according to the characteristics of each target image The information and text information determine candidate instruction information corresponding to each target image.
  • multiple target images are derived from collected target videos, and feature information of each target image is acquired respectively, voice information in the target video is acquired, and text information in the voice information is recognized.
  • the multiple target images may correspond to one piece of voice information, or may correspond to multiple pieces of voice information, and multiple candidate instruction information is determined according to the feature information of each target image and the text information of the voice information corresponding to each target image, respectively.
  • the candidate instruction information is configured as the instruction information.
  • the candidate instruction information corresponding to each target image is matched, if the candidate instruction information corresponding to each target image completely matches, or the degree of matching between each candidate instruction information is greater than the matching degree If the degree threshold is set, any candidate instruction information can be configured as instruction information.
  • the matching degree threshold may be set according to the actual situation, for example, the matching degree threshold may be set to 99%, or may be set to 99.5%, etc., which is not specifically limited in the present disclosure.
  • the instruction information is determined according to the confidence level corresponding to each candidate instruction information.
  • the confidence level corresponding to each candidate instruction information may be the confidence level corresponding to the user intent information in the candidate instruction information, the confidence level corresponding to the target object, or the confidence level corresponding to the target object.
  • the product of the confidence level corresponding to the user intent information and the confidence level corresponding to the target object, etc. is not specifically limited in the present disclosure.
  • the confidence level corresponding to the user intent information may be the degree of matching between the keywords in the text information and the keywords in the first preset mapping relationship, and the confidence level corresponding to the target object may be the object category or the object position of the target object.
  • the confidence of the target object, etc. can also be the matching degree between the object information of the target object and the text information.
  • the above-mentioned voice information, target image or target video, and sight line information can also be acquired through an intelligent assistant, and the intelligent assistant can be a mobile terminal running on a mobile terminal. application.
  • the quick-start smart assistant function can be preset.
  • the smart assistant when the mobile terminal is in the screen-off state, the smart assistant can be entered by clicking the power button three times.
  • other shortcuts can also be used to enter the smart assistant, which is not specifically limited in the present disclosure.
  • the instruction information acquisition method of this embodiment can activate the intelligent assistant on the mobile terminal through a shortcut, which simplifies the tedious steps of launching the intelligent assistant, and makes the intelligent assistant activation more intelligent, rapid, convenient and accurate.
  • FIG. 12 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 12 : in step S1201 , obtain the target image, perform object extraction on the target image, and obtain the object information of each object, wherein the object information includes the object category and the object position; In step S1203, determine the object topology relationship according to the object category and the object position of each object; In step S1205, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1207, word segmentation processing is performed on the text information to obtain a or multiple keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1209, the object category that matches the entity word segmentation is determined as the target object category; in step S1211, according to the target object category is determined in each object candidate object;
  • FIG. 13 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 13 : in step In S1301, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1303, acquire voice information associated with the target image, and determine Text information corresponding to the voice information, wherein the text information includes entity word segmentation and description word segmentation; in step S1305, word segmentation processing is performed on the text information to obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; In step S1307, the sight line information associated with the target image is obtained, and the gaze position corresponding to the sight line information is determined; in step S1309, the object position matching the gaze position is determined as the target object position; in step S1311, according to the target The object position determines the candidate object in each object; in step S1303, acquire voice information associated with the target image, and determine Text information corresponding to the voice information
  • FIG. 14 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 14 : in step In S1401, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1403, determine object topology according to the object category and object position of each object In step S1405, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1407, word segmentation processing is performed on the text information to Obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1409, obtain the line of sight information associated with the target image, and determine the gaze position corresponding to the line of sight information; in step S1411, with The object category matched by the entity word segment
  • the target image is shown in FIG. 15 .
  • the target image 1500 is identified according to the target detection algorithm.
  • the target image 1500 includes 4 objects, and the object categories of the 4 objects are "flower”, “flower pot 1", “flower” Pot 2", “water dispenser”; the text information corresponding to the user's voice information is "I want to buy the flower pot on the left", and the user's intention information is identified through the text information as "obtain the purchase path".
  • the gaze position 1501 of the user on the target image 1500 is acquired.
  • the topological relationship of the object is determined according to the object category and object position as follows: “the leftmost of the image is a potted flower” "the bottom of the flower is the flowerpot 1" "the rightmost of the image is It is the water dispenser” "The left side of the water dispenser is the flower pot 2".
  • the noun segment "flower pot” is obtained, and the adverb segment "left” and “that” are obtained; then, the noun segment is matched with the object category to determine the target object category "flower pot", which will be determined
  • the candidate objects are “flower pot 1” and “flower pot 2”, and the object topological relationship corresponding to the candidate object is matched with the adverb participle, and the candidate target objects are “flower pot 1” and “flower pot 2";
  • the object positions of the objects “flower pot 1” and “flower pot 2” are matched with the gaze positions, and the target object that the user pays most attention to is determined as "flower pot 1".
  • the mobile terminal sends the sub-target image corresponding to "flowerpot 1" to the corresponding shopping website, so as to obtain the shopping link of the same item related to "flowerpot 1" .
  • FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.
  • the instruction information acquisition apparatus 1600 includes an image information extraction module 1601 , a text information acquisition module 1602 , and an instruction information generation module 1603 .
  • the image information extraction module 1601 is used to obtain the target image and perform information extraction on the target image to obtain the feature information of the target image
  • the text information acquisition module 1602 is used to obtain the voice information and identify the text information corresponding to the voice information, The voice information is information associated with the target image
  • the instruction information generating module 1603 is configured to generate instruction information according to the text information and feature information of the target image.
  • the image information extraction module 1601 may also be configured to perform object extraction on the target image to obtain object information of each object in the target image.
  • the instruction information generation module 1603 may also be configured to match each object information with text information respectively, determine the target object according to the matching result, and generate instruction information according to the object information of the target object.
  • the instruction information generation module 1603 can also be used to determine the object topological relationship according to the object category and the object position; determine the object category that matches the entity word segmentation as the target object category; A candidate object is determined from each object; a target object is determined from the candidate objects according to the object topological relationship corresponding to the candidate object and the description word segmentation.
  • the object information includes object category and object location
  • the text information includes entity word segmentation and description word segmentation.
  • the instruction information generation module 1603 can also be used to obtain line-of-sight information and determine the gaze position corresponding to the line-of-sight information; according to the object topology relationship corresponding to the candidate object and the description word segmentation, determine the candidate object from the candidate object target object; match the object position of the candidate target object with the gaze position; when the object position of the candidate target object matches the gaze position, determine the candidate target object as the target object.
  • the instruction information generation module 1603 can also be used to determine the position of the object that matches the gaze position as the target object position; determine candidate objects in each object according to the target object position; The object information of the selected object is matched with the text information, and the target object is determined according to the matching result.
  • the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate instruction information according to the object information of the target object and the user intent information.
  • the instruction information generation module 1603 can also be used to perform word segmentation processing on the text information to obtain one or more keywords; determine the user intent corresponding to the keywords according to the first preset mapping relationship information, and the first preset mapping relationship includes an association relationship between keywords and user intent information.
  • the image information extraction module 1601 can also be used to obtain the object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category; Obtain the feature vector of each object, and determine the second predicted category of each object and the second confidence level corresponding to the second predicted category according to the second preset mapping relationship; according to the first predicted category and the second predicted category, and the first predicted category The first confidence level corresponding to the category and the second confidence level corresponding to the second predicted category determine the object category of each object; wherein, the second preset mapping relationship includes an association relationship between the feature vector and the second predicted category.
  • the image information extraction module 1601 can also be used to crop the target image according to the position of the object to obtain sub-target images corresponding to each object; perform feature extraction on the sub-target images to obtain The eigenvectors of each object.
  • the image information extraction module 1601 can also be used to determine whether the first predicted category is the same as the second predicted category; when the first predicted category is the same as the second predicted category, the first predicted category The category is configured as the object category of each object; when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result.
  • the image information extraction module 1601 may also be configured to configure the first predicted category as the object category of each object when the first confidence level is greater than the second confidence level; When less than or equal to the second confidence level, the second predicted category is configured as the object category of each object.
  • the image information extraction module 1601 may also be configured to perform information extraction on the target image to obtain image parameter information of the target image.
  • the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate parameter adjustment information according to the user intent information and image parameter information.
  • the instruction information generation module 1603 can also be configured to generate parameter adjustment information according to the text information and image parameter information of the target image; and determine the target according to the object information and text information of each object in the target image Object; generate instruction information according to parameter adjustment information and object information of the target object.
  • the instruction information generation module 1603 may also be configured to determine alternative instruction information corresponding to each target image according to the feature information and text information of each target image; When the instruction information is the same, the candidate instruction information is configured as the instruction information; when the candidate instruction information corresponding to each target image is different, the instruction information is determined according to the confidence level corresponding to each candidate instruction information.
  • the target image includes multiple.
  • the instruction information acquisition apparatus may further include an information display module (not shown in the figure), the information display module is configured to acquire and display objects related to the target object according to the user intention information in the instruction information information-related object acquisition path; and/or acquire and display object detailed information related to the object information of the target object according to the user intent information in the instruction information.
  • the information display module may also be configured to perform parameter adjustment on the target image according to the parameter adjustment information, and display the parameter-adjusted target image.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a mobile terminal, the program code is used to cause the mobile terminal to execute the above-mentioned procedures in this specification.
  • the steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure for example, any one or more of the steps in FIGS. 3 to 14 may be performed.
  • Exemplary embodiments of the present disclosure also provide a program product for implementing the above method, which may adopt a portable compact disc read only memory (CD-ROM) and include program codes, and may be stored on a mobile terminal such as a personal computer run.
  • a program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

Provided are an instruction information acquisition method and apparatus, a readable storage medium, and an electronic device, relating to the technical field of artificial intelligence. The method comprises: acquiring a target image and performing information extraction on the target image to obtain feature information of the target image (S310); acquiring voice information and identifying text information corresponding to the voice information, wherein the voice information is information associated with the target image (S320); and generating instruction information according to the text information and the feature information of the target image (S330). By fusing the feature information of the target image and the voice information associated with the target image to generate instruction information, the present technical solution improves the accuracy of instruction information.

Description

指令信息获取方法及装置、可读存储介质、电子设备Instruction information acquisition method and device, readable storage medium, and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2021年03月18日提交的申请号为202110292701.7、名称为“指令信息获取方法及装置、可读存储介质、电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。This application claims the priority of the Chinese patent application with the application number 202110292701.7 and the title of "Instruction Information Acquisition Method and Device, Readable Storage Medium, Electronic Equipment" filed on March 18, 2021, and the entire content of the Chinese patent application Incorporated herein by reference in its entirety.
技术领域technical field
本公开涉及人工智能技术领域,具体而言,涉及一种指令信息获取方法、指令信息获取装置、计算机可读存储介质及电子设备。The present disclosure relates to the technical field of artificial intelligence, and in particular, to a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium, and an electronic device.
背景技术Background technique
随着人工智能的飞速发展,越来越多的移动终端安装有具备语音助手或视觉助手功能的应用程序,以更好的实现与用户的交互功能。With the rapid development of artificial intelligence, more and more mobile terminals are installed with applications with voice assistant or visual assistant functions to better realize the interaction function with users.
相关技术中,移动终端通过语音助手可以实现语音控制、信息查询等功能,以及通过视觉助手可以实现图像信息获取等功能。但是,现有的语音助手或视觉助手难以准确地生成用户指令信息,用户体验较差。In the related art, the mobile terminal can realize functions such as voice control and information query through a voice assistant, and can realize functions such as image information acquisition through a visual assistant. However, it is difficult for existing voice assistants or visual assistants to accurately generate user instruction information, resulting in poor user experience.
发明内容SUMMARY OF THE INVENTION
本公开的目的在于提供一种指令信息获取方法、指令信息获取装置、计算机可读存储介质及电子设备,进而至少在一定程度上解决了相关技术中难以准确生成指令信息的问题。The purpose of the present disclosure is to provide a method for acquiring instruction information, a device for acquiring instruction information, a computer-readable storage medium and an electronic device, thereby solving the problem of difficulty in accurately generating instruction information in the related art at least to a certain extent.
根据本公开的第一方面,提供一种指令信息获取方法,所述方法包括:获取目标图像并对所述目标图像进行信息提取,以得到所述目标图像的特征信息;以及获取语音信息并识别所述语音信息对应的文本信息,其中,所述语音信息为与所述目标图像相关联的信息;根据所述文本信息以及所述目标图像的特征信息生成指令信息。According to a first aspect of the present disclosure, there is provided a method for acquiring instruction information, the method comprising: acquiring a target image and performing information extraction on the target image to obtain feature information of the target image; and acquiring voice information and identifying Text information corresponding to the voice information, wherein the voice information is information associated with the target image; instruction information is generated according to the text information and feature information of the target image.
根据本公开的第二方面,提供一种指令信息获取装置,所述指令信息获取装置包括:图像信息提取模块,用于获取目标图像并对所述目标图像进行信息提取,以得到所述目标图像的特征信息;文本信息获取模块,用于获取语音信息并识别所述语音信息对应的文本信息,其中,所述语音信息为与所述目标图像相关联的信息;指令信息生成模块,用于根据所述文本信息以及所述目标图像的特征信息生成指令信息。According to a second aspect of the present disclosure, there is provided an instruction information acquisition device, the instruction information acquisition device comprising: an image information extraction module configured to acquire a target image and perform information extraction on the target image to obtain the target image feature information; a text information acquisition module is used to acquire voice information and identify the text information corresponding to the voice information, wherein the voice information is the information associated with the target image; an instruction information generation module is used for according to The text information and the feature information of the target image generate instruction information.
根据本公开的第三方面,提供了一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述实施例中所述的指令信息获取方法。According to a third aspect of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the method for obtaining instruction information as described in the foregoing embodiments.
根据本公开的第四方面,提供了一种电子设备,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上述实施例中所述的指令信息获取方法。According to a fourth aspect of the present disclosure, there is provided an electronic device, comprising: one or more processors; a storage device for storing one or more programs, when the one or more programs are stored by the one or more programs When executed by each processor, the one or more processors are caused to implement the method for acquiring instruction information as described in the foregoing embodiments.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1示意性示出了本示例性实施方式的一种系统架构的示意图;FIG. 1 schematically shows a schematic diagram of a system architecture of the present exemplary embodiment;
图2示意性示出了本示例性实施方式的电子设备的示意图;FIG. 2 schematically shows a schematic diagram of the electronic device of the present exemplary embodiment;
图3示意性示出了根据本公开的一实施例的指令信息获取方法的流程示意图;3 schematically shows a schematic flowchart of a method for acquiring instruction information according to an embodiment of the present disclosure;
图4示意性示出了根据本公开的一实施例的获取对象信息的方法流程示意图;FIG. 4 schematically shows a schematic flowchart of a method for acquiring object information according to an embodiment of the present disclosure;
图5示意性示出了根据本公开的一实施例的确定对象类别的方法流程示意图;FIG. 5 schematically shows a schematic flowchart of a method for determining an object category according to an embodiment of the present disclosure;
图6示意性示出了根据本公开的一实施例的生成指令信息的方法流程示意图;6 schematically shows a schematic flowchart of a method for generating instruction information according to an embodiment of the present disclosure;
图7示意性示出了根据本公开的一实施例的根据匹配结果确定目标对象的流程示意图;FIG. 7 schematically shows a schematic flowchart of determining a target object according to a matching result according to an embodiment of the present disclosure;
图8示意性示出了根据本公开的一实施例的从候选对象中确定目标对象的流程示意图;FIG. 8 schematically shows a flow chart of determining a target object from candidate objects according to an embodiment of the present disclosure;
图9示意性示出了根据本公开的一实施例的确定目标对象的方法流程示意图;FIG. 9 schematically shows a schematic flowchart of a method for determining a target object according to an embodiment of the present disclosure;
图10示意性示出了根据本公开的一实施例的另一生成指令信息的方法流程示意图;FIG. 10 schematically shows a schematic flowchart of another method for generating instruction information according to an embodiment of the present disclosure;
图11示意性示出了根据本公开的一实施例的又一生成指令信息方法的流程示意图;FIG. 11 schematically shows a flow chart of yet another method for generating instruction information according to an embodiment of the present disclosure;
图12示意性示出了根据本公开的一具体实施例的指令信息获取方法的流程示意图;FIG. 12 schematically shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure;
图13示意性示出了根据本公开的另一具体实施例的指令信息获取方法的流程示意图;13 schematically shows a schematic flowchart of a method for acquiring instruction information according to another specific embodiment of the present disclosure;
图14示意性示出了根据本公开的又一具体实施例的指令信息获取方法的流程示意图;FIG. 14 schematically shows a flow chart of a method for acquiring instruction information according to another specific embodiment of the present disclosure;
图15示意性示出了根据本公开的一具体应用场景中的目标图像的结构示意图;15 schematically shows a schematic structural diagram of a target image in a specific application scenario according to the present disclosure;
图16示意性示出了根据本公开的一实施例的指令信息获取装置的框图。FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.
在本领域的相关技术中,移动终端安装有具备视觉助手和语音助手功能的应用程序。视觉助手主要捕捉用户所在的环境视觉信息,对以图片或视频的方式进行呈现的视觉信息进行分析,理解用户所在环境、对象以及对象之间的关系,进一步理解用户意图,为用户提供合理的推荐。语音助手主要捕捉用户的语音信息,并将语音信息转换为文本,进一步分析用户意图,实现与用户之间的智能交互。但在视觉助手中,在对以图片或视频的方式进行呈现的视觉信息进行分析时,存在对用户意图判断不准确,或在图片或视频中存在多个对象时,对用户最关注的物体判断不准确的问题。在语音助手中,存在因环境背景音嘈杂、设备老旧造成收音不清或用户语音含义表达不清的情况,导致语音助手难以准确分析用户意图的问题。In the related art in the art, a mobile terminal is installed with an application program with functions of a visual assistant and a voice assistant. The visual assistant mainly captures the visual information of the user's environment, analyzes the visual information presented in the form of pictures or videos, understands the user's environment, objects and the relationship between objects, further understands the user's intention, and provides users with reasonable recommendations . The voice assistant mainly captures the user's voice information, converts the voice information into text, further analyzes the user's intention, and realizes intelligent interaction with the user. However, in the visual assistant, when analyzing the visual information presented in the form of pictures or videos, there is an inaccurate judgment on the user's intention, or when there are multiple objects in the picture or video, the user is most concerned about the object judgment. inaccurate question. In voice assistants, due to noisy ambient background sounds, outdated equipment, resulting in unclear radio reception or unclear meaning of user voices, it is difficult for voice assistants to accurately analyze user intent.
基于相关技术中存在的问题,本公开实施例首先提供了一种指令信息获取方法,该指令信息获取方法应用于本公开示例性实施方式的系统架构中。Based on the problems existing in the related art, an embodiment of the present disclosure first provides a method for acquiring instruction information, and the method for acquiring instruction information is applied to the system architecture of the exemplary embodiment of the present disclosure.
图1示出了本公开示例性实施方式的一种系统架构的示意图,如图1所示,该系统架构100可以包括:终端110、网络120和服务器130。终端110可以是具有图像拍摄功能和音频采集功能的各种电子设备,包括但不限于手机、平板电脑、数码相机、个人电脑等。网络120用以在终端110和服务器130之间提供通信链路的介质,可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等。应该理解,图1中的终端、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端、网络和服务器。比如服务器130可以是多个服务器组成的服务器集群等。FIG. 1 shows a schematic diagram of a system architecture according to an exemplary embodiment of the present disclosure. As shown in FIG. 1 , the system architecture 100 may include: a terminal 110 , a network 120 and a server 130 . The terminal 110 may be various electronic devices with image capturing functions and audio capturing functions, including but not limited to mobile phones, tablet computers, digital cameras, personal computers, and the like. The medium used by the network 120 to provide a communication link between the terminal 110 and the server 130 may include various connection types, such as wired, wireless communication links, or fiber optic cables. It should be understood that the numbers of terminals, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminals, networks and servers according to implementation needs. For example, the server 130 may be a server cluster composed of multiple servers, or the like.
本公开实施方式所提供的指令信息获取方法可以由终端110执行,例如在终端110获取语音信息和目标图像之后,生成指令信息。The method for acquiring instruction information provided by the embodiment of the present disclosure may be executed by the terminal 110, for example, after the terminal 110 acquires the voice information and the target image, the instruction information is generated.
另外,本公开实施方式所提供的的指令信息获取方法也可以由服务器130执行,例如终端110获取语音信息和目标图像之后,将语音信息和目标图像上传到服务器130,使服务器130对生成指令信息,本公开对此不做限定。In addition, the method for obtaining the instruction information provided by the embodiments of the present disclosure may also be executed by the server 130. For example, after the terminal 110 obtains the voice information and the target image, the voice information and the target image are uploaded to the server 130, so that the server 130 can generate the instruction information. , which is not limited in the present disclosure.
本公开的示例性实施方式提供一种用于实现指令信息获取方法的电子设备,其可以是 图1中的终端110或服务器130。该电子设备至少包括处理器和存储器,存储器用于存储处理器的可执行指令,处理器配置为经由执行可执行指令来执行指令信息获取方法。Exemplary embodiments of the present disclosure provide an electronic device for implementing a method for acquiring instruction information, which may be the terminal 110 or the server 130 in FIG. 1 . The electronic device includes at least a processor and a memory, the memory is used for storing executable instructions of the processor, and the processor is configured to execute the instruction information acquisition method by executing the executable instructions.
电子设备可以以各种形式来实施,例如可以包括手机、平板电脑、笔记本电脑、个人数字助理(Personal Digital Assistant,PDA)、导航装置、可穿戴设备、无人机等移动设备,以及台式电脑、智能电视等固定设备。Electronic devices can be implemented in various forms, such as mobile phones, tablet computers, notebook computers, personal digital assistants (PDAs), navigation devices, wearable devices, drones and other mobile devices, as well as desktop computers, Fixed devices such as smart TVs.
下面以图2中的移动终端200为例,对电子设备的构造进行示例性说明。本领域技术人员应当理解,除了特别用于移动目的的部件之外,图2中的构造也能够应用于固定类型的设备。在另一些实施方式中,移动终端200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或软件和硬件的组合实现。各部件间的接口连接关系只是示意性示出,并不构成对移动终端200的结构限定。在另一些实施方式中,移动终端200也可以采用与图2不同的接口连接方式,或多种接口连接方式的组合。The following takes the mobile terminal 200 in FIG. 2 as an example to illustrate the structure of the electronic device. It will be understood by those skilled in the art that the configuration in Figure 2 can also be applied to stationary type devices, in addition to components specifically for mobile purposes. In other embodiments, the mobile terminal 200 may include more or fewer components than shown, or combine some components, or separate some components, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interface connection relationship between the components is only schematically shown, and does not constitute a structural limitation of the mobile terminal 200 . In other embodiments, the mobile terminal 200 may also adopt an interface connection manner different from that in FIG. 2 , or a combination of multiple interface connection manners.
如图2所示,移动终端200具体可以包括:处理器210、内部存储器221、外部存储器接口222、USB接口230、充电管理模块240、电源管理模块241、电池242、天线1、天线2、移动通信模块250、无线通信模块260、音频模块270、扬声器271、受话器272、麦克风273、耳机接口274、传感器模块280、显示屏290、摄像模组291、指示器292、马达293、按键294以及用户标识模块(Subscriber Identification Module,SIM)卡接口295等。传感器模块280可以包括深度传感器2801、压力传感器2802、陀螺仪传感器2803、气压传感器2804等。移动终端200通过图形处理器(Graphics Processing Unit,缩写:GPU)、显示屏290及应用处理器等实现显示功能。GPU用于执行数学和几何计算,以实现图形渲染,并连接显示屏290和应用处理器。处理器210可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。移动终端200可以包括一个或多个显示屏290,用于显示图像,视频等。移动终端200可以通过图像信号处理(Image Signal Processing,ISP)、摄像模组291、编码器、解码器、GPU、显示屏290及应用处理器等实现拍摄功能。摄像模组291用于捕获静态图像或视频,通过感光元件采集光信号,转换为电信号。ISP用于处理摄像模组291反馈的数据,将电信号转换成数字图像信号。移动终端200可以通过音频模块270、扬声器271、受话器272、麦克风273、耳机接口274及应用处理器等实现音频功能。例如音乐播放、录音等。音频模块270用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块270还可以用于对音频信号编码和解码。扬声器271,用于将音频电信号转换为声音信号。受话器272,用于将音频电信号转换成声音信号。麦克风273,用于将声音信号转换为电信号。耳机接口274,用于连接有线耳机。按键294包括开机键,音量键等。按键294可以是机械按键,也可以是触摸式按键。移动终端200可以接收按键输入,产生与移动终端400的用户设置以及功能控制有关的键信号输入。As shown in FIG. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a USB interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile Communication module 250, wireless communication module 260, audio module 270, speaker 271, receiver 272, microphone 273, headphone jack 274, sensor module 280, display screen 290, camera module 291, indicator 292, motor 293, buttons 294 and user Identification module (Subscriber Identification Module, SIM) card interface 295 and so on. The sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyro sensor 2803, an air pressure sensor 2804, and the like. The mobile terminal 200 implements a display function through a graphics processor (Graphics Processing Unit, abbreviation: GPU), a display screen 290, an application processor, and the like. The GPU is used to perform mathematical and geometric calculations for graphics rendering, and connects the display 290 and the application processor. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information. The mobile terminal 200 may include one or more display screens 290 for displaying images, videos, and the like. The mobile terminal 200 may implement a shooting function through image signal processing (Image Signal Processing, ISP), a camera module 291, an encoder, a decoder, a GPU, a display screen 290, an application processor, and the like. The camera module 291 is used to capture still images or videos, collect light signals through photosensitive elements, and convert them into electrical signals. The ISP is used to process the data fed back by the camera module 291 and convert the electrical signal into a digital image signal. The mobile terminal 200 may implement audio functions through an audio module 270, a speaker 271, a receiver 272, a microphone 273, an earphone interface 274, an application processor, and the like. Such as music playback, recording, etc. The audio module 270 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 270 may also be used to encode and decode audio signals. The speaker 271 is used for converting audio electrical signals into sound signals. The receiver 272 is used for converting audio electrical signals into sound signals. The microphone 273 is used to convert the sound signal into an electrical signal. The earphone interface 274 is used for connecting wired earphones. The keys 294 include a power-on key, a volume key, and the like. The keys 294 may be mechanical keys or touch keys. The mobile terminal 200 may receive key inputs and generate key signal inputs related to user settings and function control of the mobile terminal 400 .
下面对本公开示例性实施方式的指令信息获取方法和指令信息获取装置进行具体说明。图3示出了指令信息获取方法的流程示意图,如图3所示,该指令信息获取方法至少包括以下步骤:步骤S310:获取目标图像并对目标图像进行信息提取,以得到目标图像的特征信息;步骤S320:获取语音信息并识别语音信息对应的文本信息,其中,语音信息为与目标图像相关联的信息;步骤S330:根据文本信息以及目标图像的特征信息生成指令信息。A method for acquiring instruction information and an apparatus for acquiring instruction information according to an exemplary embodiment of the present disclosure will be specifically described below. FIG. 3 shows a schematic flowchart of a method for acquiring instruction information. As shown in FIG. 3 , the method for acquiring instruction information includes at least the following steps: Step S310 : acquire a target image and perform information extraction on the target image to obtain feature information of the target image Step S320: Acquire voice information and identify text information corresponding to the voice information, wherein the voice information is information associated with the target image; Step S330: Generate instruction information according to the text information and the feature information of the target image.
本公开中的指令信息获取方法,能够将目标图像的特征信息和与目标图像相关联的语音信息相融合,以生成指令信息,提高了指令信息的准确率,进而提升了用户与移动终端的交互体验。为了使本公开的技术方案更清晰,接下来对指令信息获取方法的进行说明。The method for obtaining instruction information in the present disclosure can fuse the feature information of the target image and the voice information associated with the target image to generate instruction information, improve the accuracy of the instruction information, and further enhance the interaction between the user and the mobile terminal experience. In order to make the technical solutions of the present disclosure clearer, the following describes a method for acquiring instruction information.
在步骤S310中,获取目标图像并对目标图像进行信息提取,以得到目标图像的特征信息。In step S310, a target image is acquired and information extraction is performed on the target image to obtain feature information of the target image.
在本公开的示例性实施例中,目标图像可以是通过移动终端的摄像功能实时拍摄的图像,还可以是存储在移动终端中的本地图像。用户可以向移动终端发送图像获取请求,移动终端根据图像获取请求确定目标图像。其中,图像获取请求可以是拍摄请求,移动终端响应拍摄请求并开启拍摄功能,实时采集目标图像。拍摄请求可以是用户触发移动终端上的拍摄按钮,比如,用户点击移动终端上的相机图标,该拍摄请求还可以是用户通过预设语音唤醒移动终端的拍摄功能。图像获取请求还可以是图像选择请求,移动终端响应图像选择请求并显示本地图像,并响应用户针对本地图像的触发操作,根据触发操作在本地图像中确定目标图像。另外,目标图像可以是一个或多个。比如,移动终端开启拍摄功能,通过摄像模组采集目标视频,在目标视频中每隔预设时间段获取一视频帧,将获取到的多个视频帧作为目标图像。其中,预设时间段可以根据实际情况进行设定,比如,可以在目标视频中每隔30ms获取一视频帧,本公开对此不作具体限定。In an exemplary embodiment of the present disclosure, the target image may be an image captured in real time by a camera function of the mobile terminal, or may be a local image stored in the mobile terminal. The user can send an image acquisition request to the mobile terminal, and the mobile terminal determines the target image according to the image acquisition request. The image acquisition request may be a shooting request, and the mobile terminal responds to the shooting request and activates the shooting function to collect the target image in real time. The shooting request may be a user triggering a shooting button on the mobile terminal, for example, the user clicks a camera icon on the mobile terminal, and the shooting request may also be a shooting function in which the user wakes up the mobile terminal through a preset voice. The image acquisition request may also be an image selection request. The mobile terminal responds to the image selection request and displays a local image, and responds to a user's trigger operation on the local image, and determines the target image in the local image according to the trigger operation. In addition, the target image may be one or more. For example, the mobile terminal enables the shooting function, collects the target video through the camera module, acquires a video frame in the target video every preset time period, and uses the acquired multiple video frames as the target image. The preset time period may be set according to the actual situation, for example, a video frame may be acquired every 30ms in the target video, which is not specifically limited in the present disclosure.
在本公开的示例性实施例中,目标图像的特征信息包括目标图像中各对象的对象信息和/或目标图像的图像参数信息。其中,对象信息包括对象类别和对象位置,图像参数信息可以包括图像亮度、色度、对比度、饱和度或清晰度等参数信息。在本公开的示例性实施例中,对目标图像进行对象提取,获取目标图像中各对象的对象信息。其中,对象信息包括对象类别和对象位置。In an exemplary embodiment of the present disclosure, the feature information of the target image includes object information of each object in the target image and/or image parameter information of the target image. The object information includes object category and object position, and the image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness. In an exemplary embodiment of the present disclosure, object extraction is performed on the target image, and object information of each object in the target image is acquired. The object information includes object category and object location.
具体地,通过目标检测模型或图像分割模型对目标图像进行对象提取,并获取目标图像中各对象的对象类别和对象位置。其中,目标检测模型可以是Faster R-CNN模型、RetinaNet模型或YOLO模型等,图像分割模型可以是DeepLab-V3模型、RefineNet模型或PSPNet模型等。另外,还可以利用显著性检测模型对目标图像进行对象提取,获取目标图像中各对象的对象位置。其中,显著性检测模型可以是基于谱残差法的显著性检测模型,还可以是基于全局对比度的显著性检测模型,本公开对此不作具体限定。Specifically, object extraction is performed on the target image through a target detection model or an image segmentation model, and the object category and object position of each object in the target image are acquired. Among them, the target detection model can be Faster R-CNN model, RetinaNet model or YOLO model, etc., and the image segmentation model can be DeepLab-V3 model, RefineNet model or PSPNet model, etc. In addition, the saliency detection model can also be used to extract objects from the target image to obtain the object positions of each object in the target image. The saliency detection model may be a saliency detection model based on a spectral residual method, or a saliency detection model based on global contrast, which is not specifically limited in the present disclosure.
在通过显著性检测模型得到目标图像中各对象的对象位置之后,根据各对象的对象位置确定各对象的对象类别。确定各对象的对象类别的详细过程如下:首先,根据各对象的对象位置对目标图像进行裁剪,以得到各对象对应的子目标图像;然后,对各对象对应的子目标图像进行特征提取,以得到各对象的特征向量;最后,根据第二预设映射关系确定各对象的特征向量对应的第二预测类别,将各对象的特征向量对应的第二预测类别配置为各对象的对象类别。其中,第二预测映射关系包括特征向量与第二预测类别的关联关系。还有,预先获取多个目标图像样本,根据多个目标图像样本分别对二分类模型、目标检测模型、图像分割模型、以及显著性检测模型进行训练。其中,该目标图像样本可以是带有矩形框或mask标注的图像。After the object position of each object in the target image is obtained through the saliency detection model, the object category of each object is determined according to the object position of each object. The detailed process of determining the object category of each object is as follows: first, the target image is cropped according to the object position of each object to obtain the sub-target image corresponding to each object; then, feature extraction is performed on the sub-target image corresponding to each object to obtain Obtain the feature vector of each object; finally, determine the second predicted category corresponding to the feature vector of each object according to the second preset mapping relationship, and configure the second predicted category corresponding to the feature vector of each object as the object category of each object. Wherein, the second prediction mapping relationship includes the association relationship between the feature vector and the second prediction category. In addition, a plurality of target image samples are acquired in advance, and a binary classification model, a target detection model, an image segmentation model, and a saliency detection model are respectively trained according to the plurality of target image samples. Wherein, the target image sample can be an image marked with a rectangular frame or a mask.
在本公开的示例性实施例中,在对目标图像进行对象提取,获取目标图像中各对象的对象信息之前,可以通过目标图像中各像素点的像素值来判断目标图像中是否存在对象。若目标图像中各像素点的像素值相同,则判定该目标图像中不存在对象,若目标图像中各像素点的像素值不同,则判定该目标图像中存在一个或多个对象。另外,还可以根据二分类模型来判断目标图像中是否存在对象。在目标图像中存在对象时,对目标图像进行对象提取,并获取各对象的对象信息。另外,还可以预先判断摄像模组所采集的画面中是否存在对象,在画面中存在对象时,实时获取目标图像或目标视频。In an exemplary embodiment of the present disclosure, before the object extraction is performed on the target image and the object information of each object in the target image is obtained, whether there is an object in the target image can be determined by the pixel value of each pixel in the target image. If the pixel value of each pixel in the target image is the same, it is determined that there is no object in the target image, and if the pixel value of each pixel in the target image is different, it is determined that one or more objects exist in the target image. In addition, whether there is an object in the target image can also be determined according to the binary classification model. When an object exists in the target image, the target image is subjected to object extraction, and the object information of each object is obtained. In addition, it can also be pre-determined whether there is an object in the picture captured by the camera module, and when there is an object in the picture, the target image or the target video can be acquired in real time.
在本公开的示例性实施例中,图4示出了获取对象信息的方法流程示意图,如图4所示,该流程至少可以包括步骤S410至步骤S430,详细介绍如下:在步骤S410中,获取各对象的对象位置、各对象的第一预测类别,以及第一预测类别对应的第一置信度。在本公开的示例性实施例中,将目标图像输入目标检测模型或图像分割模型中,以得到目标图像中各对象的对象位置,各对象的第一预测类别,以及第一预测类别对应的第一置信度。其中,对象的对象位置可以包括该对象在目标图像中的位置坐标,具体地,对象位置可以为该对象所在的检测框的位置坐标集合,位置坐标集合包括水平方向上的起始坐标和终止坐 标,以及竖直方向上的起始坐标和终止坐标;对象位置还可以为该对象所在的检测框的起始坐标点,以及该检测框的大小,起始坐标点包括水平方向上的起始坐标和竖直方向上的起始坐标,检测框的大小包括水平方向上的大小和竖直方向上的大小。第一置信度表示对象的第一预测类别为对象的真实对象类别的概率。In an exemplary embodiment of the present disclosure, FIG. 4 shows a schematic flowchart of a method for acquiring object information. As shown in FIG. 4 , the process may include at least steps S410 to S430, and the details are as follows: In step S410, acquiring The object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category. In an exemplary embodiment of the present disclosure, the target image is input into the target detection model or the image segmentation model to obtain the object position of each object in the target image, the first predicted category of each object, and the first predicted category corresponding to the first predicted category. a confidence level. The object position of the object may include the position coordinates of the object in the target image. Specifically, the object position may be the position coordinate set of the detection frame where the object is located, and the position coordinate set includes the start coordinates and the end coordinates in the horizontal direction. , and the starting coordinates and ending coordinates in the vertical direction; the object position can also be the starting coordinate point of the detection frame where the object is located, and the size of the detection frame, and the starting coordinate point includes the starting coordinate in the horizontal direction. and the starting coordinates in the vertical direction, the size of the detection frame includes the size in the horizontal direction and the size in the vertical direction. The first confidence level represents the probability that the first predicted category of the object is the real object category of the object.
在步骤S420中,根据对象位置获取各对象的特征向量,并根据第二预设映射关系确定各对象的第二预测类别以及第二预测类别对应的第二置信度。在本公开的示例性实施例中,根据对象位置对目标图像进行裁剪,以得到与各对象对应的子目标图像;对子目标图像进行特征提取,以得到各对象的特征向量。具体地,将子目标图像输入特征提取模型中,以获取子目标图像对应的特征向量。其中,特征提取模型可以是颜色直方图模型,通过颜色直方图模型提取子目标图像的颜色特征,还可以是局部二值模式(Local Binary Pattern,LBP)模型或灰度共生矩阵模型,通过LBP模型或灰度共生矩阵模型用来提取子目标图像的图像局部纹理特征,或是Canny算子边缘检测或Sobel算子边缘检测模型,通过Canny算子边缘检测或Sobel算子边缘检测模型提取子目标图像的边缘特征等等。另外,特征提取模型还可以是颜色直方图模型、LBP模型或灰度共生矩阵模型、Canny算子边缘检测或Sobel算子边缘检测模型中的两种或两种以上模型的组合。从而通过子目标图像的颜色特征、局部纹理特征、边缘特征中的一个或多个特征构建各对象的特征向量。还有,预先获取子目标图像样本,通过子目标图像样本对特征提取模型进行训练。其中,子目标图像样本为只包括单个对象的图像。In step S420, the feature vector of each object is obtained according to the object position, and the second predicted category of each object and the second confidence level corresponding to the second predicted category are determined according to the second preset mapping relationship. In an exemplary embodiment of the present disclosure, the target image is cropped according to the position of the object to obtain sub-target images corresponding to each object; and feature extraction is performed on the sub-target images to obtain feature vectors of each object. Specifically, the sub-target image is input into the feature extraction model to obtain the feature vector corresponding to the sub-target image. The feature extraction model may be a color histogram model, through which the color features of the sub-target images are extracted, or a local binary pattern (Local Binary Pattern, LBP) model or a gray level co-occurrence matrix model, through the LBP model Or the gray level co-occurrence matrix model is used to extract the image local texture features of the sub-target image, or the Canny operator edge detection or the Sobel operator edge detection model, and the sub-target image is extracted through the Canny operator edge detection or the Sobel operator edge detection model. edge features, etc. In addition, the feature extraction model may also be a combination of two or more of the color histogram model, the LBP model or the gray level co-occurrence matrix model, the Canny operator edge detection model or the Sobel operator edge detection model. Thereby, the feature vector of each object is constructed by one or more features of the color feature, local texture feature, and edge feature of the sub-target image. Also, the sub-target image samples are acquired in advance, and the feature extraction model is trained by using the sub-target image samples. Wherein, the sub-target image sample is an image including only a single object.
在本公开的示例性实施例中,第二预设映射关系包括特征向量与第二预测类别的关联关系。将对象的特征向量分别与第二预设映射关系中的一个或多个特征向量进行匹配,获取对象的特征向量与第二预设映射关系中的特征向量的匹配度。将匹配度最大的第二预设映射关系中的特征向量对应的第二预测类别配置为对象的第二预测类别,并将匹配度配置为第二置信度。In an exemplary embodiment of the present disclosure, the second preset mapping relationship includes an association relationship between the feature vector and the second prediction category. The feature vector of the object is respectively matched with one or more feature vectors in the second preset mapping relationship, and the matching degree between the feature vector of the object and the feature vector in the second preset mapping relationship is obtained. The second predicted category corresponding to the feature vector in the second preset mapping relationship with the largest matching degree is configured as the second predicted category of the object, and the matching degree is configured as the second confidence degree.
在步骤S430中,根据第一预测类别和第二预测类别,以及第一预测类别对应的第一置信度和第二预测类别对应的第二置信度确定各对象的对象类别。具体地,图5示出了确定对象类别的方法流程示意图,如图5所示,该流程至少包括步骤S510至步骤S530,详细介绍如下:在步骤S510中,判断第一预测类别与第二预测类别是否相同。在本公开的示例性实施例中,可以将第一预测类别与第二预测类别对应的类别标识进行比对;若第一预测类别对应的类别标识与第二预测类别对应的类别标识相同,则判定第一预测类别与第二预测类别相同;若第一预测类别对应的类别标识与第二预测类别对应的类别标识不同,则判定第一预测类别与第二预测类别不同。在步骤S520中,在第一预测类别与第二预测类别相同时,将第一预测类别或第二预测类别配置为各对象的对象类别。在本公开的示例性实施例中,由于第一预测类别与第二预测类别相同,则可以将第一预测类别或第二预测类别配置为各对象的对象类别。在步骤S530中,在第一预测类别与第二预测类别不同时,判断第一置信度是否大于第二置信度,根据判断结果确定各对象的对象类别。在本公开的示例性实施例中,在第一置信度大于第二置信度时,将第一预测类别配置为各对象的对象类别;在第一置信度小于等于第二置信度时,将第二预测类别配置为各对象的对象类别。In step S430, the object category of each object is determined according to the first predicted category and the second predicted category, and the first confidence level corresponding to the first predicted category and the second confidence level corresponding to the second predicted category. Specifically, FIG. 5 shows a schematic flow chart of a method for determining an object category. As shown in FIG. 5 , the flow includes at least steps S510 to S530. The details are as follows: In step S510, the first prediction category and the second prediction category are determined. whether the categories are the same. In an exemplary embodiment of the present disclosure, the category identifiers corresponding to the first predicted category and the second predicted category may be compared; if the category identifier corresponding to the first predicted category is the same as the category identifier corresponding to the second predicted category, then It is determined that the first predicted category is the same as the second predicted category; if the category identifier corresponding to the first predicted category is different from the category identifier corresponding to the second predicted category, it is determined that the first predicted category is different from the second predicted category. In step S520, when the first predicted class is the same as the second predicted class, the first predicted class or the second predicted class is configured as the object class of each object. In an exemplary embodiment of the present disclosure, since the first predicted category is the same as the second predicted category, the first predicted category or the second predicted category may be configured as the object category of each object. In step S530, when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result. In an exemplary embodiment of the present disclosure, when the first confidence level is greater than the second confidence level, the first prediction category is configured as the object category of each object; when the first confidence level is less than or equal to the second confidence level, the first prediction category is configured as the object category of each object; The second prediction category is configured as the object category of each object.
在本公开的示例性实施例中,还可以分别判断第一预测类别对应的第一置信度与第二预测类别对应的第二置信度是否大于等于置信度阈值;在第一置信度大于等于置信度阈值,以及第二置信度小于置信度阈值时,将第一预测类别作为对象的对象类别;若第二置信度大于等于置信度阈值,以及第一置信度小于置信度阈值时,则将第二预测类别作为对象的对象类别;若第一置信度和第二置信度均大于等于置信度阈值,则根据上述实施例确定各对象的对象类别;若第一置信度和第二置信度均小于置信度阈值,则可以舍弃该对象以及对象对应的对象信息。In an exemplary embodiment of the present disclosure, it can also be determined whether the first confidence level corresponding to the first prediction category and the second confidence level corresponding to the second prediction category are greater than or equal to the confidence threshold; if the first confidence level is greater than or equal to the confidence level If the second confidence level is greater than or equal to the confidence level threshold, and the second confidence level is less than the confidence level threshold, the first prediction category will be used as the object category of the object; if the second confidence level is greater than or equal to the confidence level threshold, and the first confidence level is less than the confidence level threshold, the The second prediction category is used as the object category of the object; if both the first confidence degree and the second confidence degree are greater than or equal to the confidence threshold, the object category of each object is determined according to the above embodiment; if the first confidence degree and the second confidence degree are both less than or equal to If the confidence threshold is set, the object and the object information corresponding to the object can be discarded.
在本公开的示例性实施例中,可以对目标图像进行信息提取,以得到目标图像的图像 参数信息。其中,图像参数信息可以包括图像亮度、色度、对比度、饱和度或清晰度等参数信息。具体地,可以通过获取摄像模组的拍摄参数,根据拍摄参数确定目标图像的图像参数信息,也可以通过获取目标图像的可交换图像文件信息(EXIF信息),根据EXIF信息确定图像参数信息。In an exemplary embodiment of the present disclosure, information extraction may be performed on the target image to obtain image parameter information of the target image. The image parameter information may include parameter information such as image brightness, chroma, contrast, saturation or sharpness. Specifically, the image parameter information of the target image can be determined according to the shooting parameters by acquiring the shooting parameters of the camera module, or the image parameter information can be determined according to the EXIF information by acquiring the exchangeable image file information (EXIF information) of the target image.
继续参照图3所示,在步骤S320中,获取语音信息并识别语音信息对应的文本信息,其中,语音信息为与目标图像相关联的信息。在本公开的示例性实施例中,语音信息可以是与目标图像相关联的信息。也就是说,在移动终端获取目标图像的时间段内或移动终端显示目标图像的时间段内,可以开启录音功能,实时采集用户的语音信息。比如,响应用户的视频拍摄请求,移动终端同时开启拍摄功能和录音功能,获取目标视频,在目标视频中获取多个目标图像和语音信息。其中,该视频拍摄请求可以是用户针对摄像头的触发操作所形成的,还可以是用户在智能助手中针对扫描功能的触发操作所形成的。终端移动终端可以预先获取智能助手的功能权限,以使在用户触发智能助手的扫描功能时,开启摄像功能和录音功能。又比如,移动终端预先获取目标图像,并将目标图像存储在存储在内部存储器或外部存储器中。移动终端根据用户请求在内部存储器或外部存储器中获取目标图像并显示在显示屏幕上,在目标图像的显示过程中,通过用户的录音请求开启录音功能,并实时采集用户的语音信息。其中,视频拍摄请求和录音请求可以是用户触发移动终端上的视频拍摄按钮或录音按钮所形成的,比如,用户触发移动终端上的相机图标或录音图标,移动终端便开启视频拍摄功能或录音功能。另外,该视频拍摄请求和录音请求还可以是用户通过预设语音唤醒移动终端的视频拍摄功能或录音功能所形成的,该预设语音可以是用户自定义设置的语音信息,还可以是移动终端预先设定的语音信息,本公开对此不作具体限定。Continuing to refer to FIG. 3 , in step S320, voice information is acquired and text information corresponding to the voice information is recognized, wherein the voice information is information associated with the target image. In an exemplary embodiment of the present disclosure, the voice information may be information associated with the target image. That is to say, in the time period when the mobile terminal acquires the target image or the time period when the mobile terminal displays the target image, the recording function can be enabled to collect the user's voice information in real time. For example, in response to a user's video shooting request, the mobile terminal enables the shooting function and the recording function at the same time, obtains the target video, and obtains multiple target images and voice information in the target video. The video shooting request may be formed by a user's triggering operation on the camera, or may be formed by a user's triggering operation on the scanning function in the smart assistant. Terminal The mobile terminal may obtain the function authority of the intelligent assistant in advance, so that when the user triggers the scanning function of the intelligent assistant, the camera function and the recording function are enabled. For another example, the mobile terminal acquires the target image in advance, and stores the target image in the internal memory or the external memory. The mobile terminal acquires the target image in the internal memory or external memory according to the user's request and displays it on the display screen. During the display of the target image, the recording function is enabled through the user's recording request, and the user's voice information is collected in real time. The video shooting request and the recording request may be formed by the user triggering the video shooting button or the recording button on the mobile terminal. For example, when the user triggers the camera icon or the recording icon on the mobile terminal, the mobile terminal starts the video shooting function or the recording function. . In addition, the video shooting request and the recording request may also be formed by the user waking up the video shooting function or the recording function of the mobile terminal through a preset voice, and the preset voice may be the voice information set by the user, or the mobile terminal. The preset voice information is not specifically limited in this disclosure.
在本公开的示例性实施例中,在获取用户的语音信息之后,识别语音信息对应的文本信息。识别语音信息对应的文本信息的过程如下:首先,对语音信息进行预处理。具体地,预处理可以包括分帧处理、加窗处理、预加重处理等。举例而言,对语音信息对应的语音序列进行预加重处理,以增加语音序列的高频分辨率;再对预加重处理后的语音序列进行分帧处理,以得到多个语音子序列;对各语音子序列进行加窗处理,加窗处理可以包括将各语音子序列与窗函数相乘,其中,窗函数可以选择矩形窗、汉明窗或汉宁窗等。然后,对预处理之后的语音信息进行语音特征提取,以得到语音信息对应的语音特征。具体地,语音信息的特征参数包括:梅尔频率倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient)、线性预测倒谱系数(LPCC,Linear Prediction Cepstrum Coefficient)、线谱频率(LSF,Linear Spectrum Frequence)、小波变换系数(WTC,Wavelet Transform Coefficient)等等。语音特征提取可以通过提取语音信息的一个或多个特征参数,将一个或多个特征参数作为语音信息对应的语音特征。最后,可以将语音特征与语音特征模板进行匹配,得到语音特征对应的文本信息。具体地,可以将语音信息对应的语音特征分别与语音特征模板中的多个语音特征样本进行匹配;在语音信息对应的语音特征与语音特征模板的语音特征样本相匹配时,将语音特征模板的语音特征样本对应的文本信息样本配置为该语音信息对应的文本信息。In an exemplary embodiment of the present disclosure, after acquiring the voice information of the user, text information corresponding to the voice information is recognized. The process of recognizing the text information corresponding to the voice information is as follows: First, the voice information is preprocessed. Specifically, the preprocessing may include frame segmentation processing, windowing processing, pre-emphasis processing, and the like. For example, pre-emphasis is performed on the speech sequence corresponding to the speech information to increase the high-frequency resolution of the speech sequence; the pre-emphasized speech sequence is then framed to obtain multiple speech subsequences; The speech subsequences are subjected to windowing processing, and the windowing processing may include multiplying each speech subsequence with a window function, wherein the window function may be a rectangular window, a Hamming window, or a Hanning window. Then, voice feature extraction is performed on the preprocessed voice information to obtain voice features corresponding to the voice information. Specifically, the characteristic parameters of the speech information include: Mel Frequency Cepstrum Coefficient (MFCC, Mel Frequency Cepstrum Coefficient), Linear Prediction Cepstrum Coefficient (LPCC, Linear Prediction Cepstrum Coefficient), Line Spectrum Frequency (LSF, Linear Spectrum Frequency), Wavelet Transform Coefficient (WTC, Wavelet Transform Coefficient) and so on. The voice feature extraction can extract one or more feature parameters of the voice information, and use the one or more feature parameters as the voice features corresponding to the voice information. Finally, the voice feature can be matched with the voice feature template to obtain text information corresponding to the voice feature. Specifically, the voice features corresponding to the voice information can be matched with a plurality of voice feature samples in the voice feature template respectively; when the voice features corresponding to the voice information are matched with the voice feature samples of the voice feature template, the The text information sample corresponding to the voice feature sample is configured as the text information corresponding to the voice information.
另外,若语音信息对应的语音特征与语音特征模板的语音特征样本不能完全匹配,则获取语音特征样本与语音特征之间的匹配度,并将匹配度最大的语音特征样本对应的文本信息样本配置为语音信息对应的文本信息。其中,语音特征模板中包括多个语音特征样本,以及各语音特征样本对应的文本信息样本。另外,构建语音特征模板的过程包括:首先,获取多个文本信息样本,并获取文本信息样本对应的语音信息;然后,根据上述过程获取文本信息样本的语音信息对应的语音特征样本;最后,根据文本信息样本和文本信息样本对应的语音特征样本的映射关系构建语音特征模板。In addition, if the voice feature corresponding to the voice information and the voice feature sample of the voice feature template cannot completely match, obtain the matching degree between the voice feature sample and the voice feature, and configure the text information sample corresponding to the voice feature sample with the largest matching degree. It is the text information corresponding to the voice information. The voice feature template includes multiple voice feature samples and text information samples corresponding to each voice feature sample. In addition, the process of constructing the voice feature template includes: first, acquiring a plurality of text information samples, and acquiring the voice information corresponding to the text information samples; then, acquiring the voice feature samples corresponding to the voice information of the text information samples according to the above process; A voice feature template is constructed by the mapping relationship between the text information samples and the voice feature samples corresponding to the text information samples.
在本公开的示例性实施例中,在获取或显示目标图像或目标视频时,开启录音功能,通过录音功能判断是否存在语音信息,或存在语音信息,在采集语音信息。在本公开的示 例性实施例中,在得到语音信息对应的文本信息之后,对文本信息进行分词处理,以得到一个或多个关键词。分词处理可以包括下述两种方法:第一种,基于词典的分词。将文本信息按照词典切分成多个词,再将多个词组合。其中,可以预先构建词典,并对词典中的词根据词性的不同进行标注。在将文本信息按照词典划分为一个或多个关键词之后,也会根据词典中各词的词性得到各关键词的词性。另外,也可以利用不标注词性的词典对文本信息进行分词处理,在分词处理之后,再对各关键词进行词性识别。文本信息对应的关键词根据词性的不同可以包括实体分词、描述分词、动词分词等。实体分词表示真实的物体或指代真实的物体的词,比如,名词分词、代词分词,具体可以是“花”、“衣服”、“你”等;描述分词表示物体之间的关系或用于形容物品的词,比如,形容词分词、副词分词,具体可以是“左边”、“漂亮”、“好暗”等。第二种,基于字的分词。将文本信息分成多个字,再将多个字组合成词,可以根据词典对多个字进行组合。当然,还可以利用基于统计的分词算法,对文本信息进行分词处理,本公开对分词处理的算法不作具体限定。In the exemplary embodiment of the present disclosure, when acquiring or displaying the target image or target video, the recording function is enabled, and the recording function is used to determine whether there is voice information, or there is voice information, and the voice information is being collected. In an exemplary embodiment of the present disclosure, after the text information corresponding to the voice information is obtained, word segmentation processing is performed on the text information to obtain one or more keywords. The word segmentation process can include the following two methods: The first one is word segmentation based on a dictionary. Divide the text information into multiple words according to the dictionary, and then combine the multiple words. Among them, a dictionary can be pre-built, and words in the dictionary can be marked according to different parts of speech. After the text information is divided into one or more keywords according to the dictionary, the part of speech of each keyword is also obtained according to the part of speech of each word in the dictionary. In addition, word segmentation processing can also be performed on the text information by using a dictionary that does not mark parts of speech, and after the word segmentation processing, part-of-speech recognition is performed on each keyword. The keywords corresponding to the text information may include entity participles, descriptive participles, verb participles, etc. according to different parts of speech. The entity participle represents a real object or a word that refers to a real object, such as a noun participle, a pronoun participle, which can be "flower", "clothes", "you", etc.; the description participle represents the relationship between objects or is used for Words that describe items, such as adjective participles and adverbial participles, can be "left side", "beautiful", "so dark" and so on. The second is word-based segmentation. Divide text information into multiple words, and then combine multiple words into words, and combine multiple words according to the dictionary. Certainly, a word segmentation algorithm based on statistics can also be used to perform word segmentation processing on the text information, and the present disclosure does not specifically limit the word segmentation processing algorithm.
继续参照图3所示,在步骤S330中,根据文本信息以及目标图像的特征信息生成指令信息。在本公开的示例性实施例中,目标图像的特征信息为目标图像中各对象的对象类别和对象位置。图6示出了生成指令信息的方法流程示意图,如图6所示,该流程至少包括步骤S610至步骤S620,详细介绍如下:在步骤S610中,分别将各对象信息与文本信息进行匹配,根据匹配结果确定目标对象。在本公开的示例性实施例中,分别将各对象的对象类别和对象位置与文本信息中的实体分词和描述分词进行匹配;在对象的对象类别和对象位置与文本信息中的实体分词和描述分词相匹配时,将该对象确定为目标对象。Continuing to refer to FIG. 3 , in step S330 , instruction information is generated according to the text information and the feature information of the target image. In an exemplary embodiment of the present disclosure, the feature information of the target image is the object category and object position of each object in the target image. FIG. 6 shows a schematic flowchart of a method for generating instruction information. As shown in FIG. 6 , the process includes at least steps S610 to S620. The details are as follows: In step S610, each object information and text The matching result determines the target object. In an exemplary embodiment of the present disclosure, the object category and object position of each object are respectively matched with the entity segmentation and description segmentation in the text information; the object category and object location of the object and the entity segmentation and description in the text information When the word segmentation matches, the object is determined as the target object.
具体地,图7示出了根据匹配结果确定目标对象的方法流程示意图,如图7所示,该流程至少包括步骤S710至步骤S740,详细介绍如下:在步骤S710中,根据对象类别和对象位置确定对象拓扑关系。在本公开的示例性实施例中,对象位置包括各对象在目标图像中的位置坐标,将任意两个对象的位置坐标相减,以得到两个对象之间的相对位置关系。将各对象的对象类别作为标签,并根据各对象之间的相对位置关系生成对象拓扑关系。其中,对象拓扑关系中包括各对象的对象类别、各对象的对象位置,以及各对象之间的相对位置关系。该步骤也可以在获取各对象的对象信息之后进行,本公开对此不作具体限定。在步骤S720中,将与实体分词匹配的对象类别确定为目标对象类别。在本公开的示例性实施例中,将文本信息中的实体分词与各对象的对象类别进行匹配,在对象的对象类别与实体分词相匹配时,将对象的对象类别确定为目标对象类别。本实施例是利用语音信息中的实体分词对多个对象类别进行筛选,若语音信息对应的文本信息中存在一个或多个对象类别,则将该一个或多个对象类别确定为目标对象类别。在步骤S730中,根据目标对象类别在各对象中确定候选对象。在本公开的示例性实施例中,在目标图像中,一个或多个对象可能对应相同的对象类别。根据上一实施例筛选出的目标对象类别可以进一步对目标图像中的多个对象进行筛选,在对象对应的对象类别为目标对象类别时,将该对象确定为候选对象。在步骤S740中,根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定目标对象。在本公开的示例性实施例中,在对象拓扑关系中确定候选对象对应的对象拓扑关系,将文本信息中的描述分词与候选对象对应的对象拓扑关系进行匹配,在描述分词与候选对象对应的对象拓扑关系相匹配时,将候选对象确定为目标对象。Specifically, FIG. 7 shows a schematic flowchart of a method for determining a target object according to a matching result. As shown in FIG. 7 , the process includes at least steps S710 to S740, and the details are as follows: In step S710, according to the object category and the object position Determine object topology. In an exemplary embodiment of the present disclosure, the object position includes the position coordinates of each object in the target image, and the position coordinates of any two objects are subtracted to obtain the relative position relationship between the two objects. The object category of each object is used as a label, and the object topology relationship is generated according to the relative positional relationship between the objects. The object topological relationship includes the object type of each object, the object position of each object, and the relative positional relationship between each object. This step may also be performed after acquiring the object information of each object, which is not specifically limited in the present disclosure. In step S720, the object category matching the entity word segmentation is determined as the target object category. In an exemplary embodiment of the present disclosure, an entity word in the text information is matched with the object category of each object, and when the object category of the object matches the entity word, the object category of the object is determined as the target object category. In this embodiment, the entity word segmentation in the voice information is used to screen multiple object categories. If there is one or more object categories in the text information corresponding to the voice information, the one or more object categories are determined as the target object category. In step S730, candidate objects are determined among the objects according to the target object category. In an exemplary embodiment of the present disclosure, in the target image, one or more objects may correspond to the same object category. The target object category filtered out according to the previous embodiment can further screen multiple objects in the target image, and when the object category corresponding to the object is the target object category, the object is determined as a candidate object. In step S740, the target object is determined from the candidate objects according to the object topological relationship and the description word segmentation corresponding to the candidate objects. In an exemplary embodiment of the present disclosure, the object topology relationship corresponding to the candidate object is determined in the object topology relationship, the description word segmentation in the text information is matched with the object topology relationship corresponding to the candidate object, and the description word segment corresponds to the candidate object. When the object topological relationship matches, the candidate object is determined as the target object.
本实施例是利用语音信息中的描述分词对一个或多个候选对象进行筛选,从而在候选对象中确定目标对象,提高了信息获取的准确率。另外,还可以获取候选对象的数量,在候选对象的数量为单个时,直接将候选对象确定为目标对象,提高信息获取效率,减小系统消耗。需要说明的是,步骤S710可以在步骤S720之前执行,还可以在步骤S730之后执行,还可以与步骤S720和步骤S730同时执行,本公开对此不作具体限定。In this embodiment, one or more candidate objects are screened by using the description word segmentation in the speech information, so that the target object is determined among the candidate objects, and the accuracy of information acquisition is improved. In addition, the number of candidate objects can also be acquired, and when the number of candidate objects is single, the candidate object is directly determined as the target object, which improves the efficiency of information acquisition and reduces system consumption. It should be noted that, step S710 may be performed before step S720, may be performed after step S730, or may be performed simultaneously with step S720 and step S730, which is not specifically limited in the present disclosure.
在本公开的示例性实施例中,还可以获取视线信息并确定视线信息对应的注视位置。该注视位置可以是一注视点,还可以是在二维平面上的注视区域。其中,该视线信息是针 对目标图像所生成的,在拍摄目标图像的过程中实时获取视线信息。由于目标图像可以包括多个,则视线信息也包括多个,根据目标图像的拍摄时间和视线信息的获取时间确定目标图像与视线信息之间的关联关系。可以通过移动终端的摄像模组或智慧屏获取用户针对目标图像的视线信息,比如,通过智能头盔或眼镜中的内置摄像模组实时获取用户针对目标图像的视线信息。具体地,视线信息还可以包括左眼图像、右眼图像、人脸图像以及人脸位置,人脸图像可以提供头部姿态信息,人脸位置可以提供眼睛位置信息,视线信息作为输入,利用注视点估计算法确定该视线信息对应的注视点。另外,还可以将头部图片和头部位置作为输入,确定对应的注视区域等,本公开对注视位置的获取不作具体限定。In an exemplary embodiment of the present disclosure, line-of-sight information can also be acquired and a gaze position corresponding to the line-of-sight information can be determined. The gaze position may be a gaze point or a gaze area on a two-dimensional plane. Wherein, the sight line information is generated for the target image, and the sight line information is acquired in real time in the process of shooting the target image. Since the target image may include multiple pieces, the line of sight information also includes multiple pieces, and the relationship between the target image and the line-of-sight information is determined according to the shooting time of the target image and the acquisition time of the line-of-sight information. The user's line-of-sight information for the target image can be obtained through the camera module or smart screen of the mobile terminal. For example, the user's line-of-sight information for the target image can be obtained in real time through the built-in camera module in the smart helmet or glasses. Specifically, the line of sight information may also include a left eye image, a right eye image, a face image, and a face position. The face image may provide head posture information, and the face position may provide eye position information. The line of sight information is used as an input. The point estimation algorithm determines the gaze point corresponding to the sight line information. In addition, the head picture and the head position may also be used as input to determine the corresponding gaze area, etc. The present disclosure does not specifically limit the acquisition of the gaze position.
在本公开的示例性实施例中,利用视线信息对应的注视区域可以对上述实施例中确定的目标对象进行更为准确的筛选,从而确定出用户最关注的目标对象。图8示出了从候选对象中确定目标对象的方法流程示意图,如图8所示,该流程至少包括步骤S810至步骤S830,详细介绍如下:在步骤S810中,根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定候选目标对象。在本公开的示例性实施例中,在候选对象之间的拓扑关系中,与描述分词匹配的候选对象存在多个,则将多个候选对象作为候选目标对象。在步骤S820中,将候选目标对象的对象位置与注视位置进行匹配。在本公开的示例性实施例中,获取候选目标对象的对象位置,将候选目标对象的对象位置与注视位置进行匹配。若注视位置为一注视点,则判断注视点是否位于各候选目标对象的对象位置所确定的检测框之内;若注视位置为一注视区域,则计算注视区域与各候选目标对象的对象位置所确定的检测框之间的重合度。在步骤S830中,在候选目标对象的对象位置与注视位置相匹配时,将候选目标对象确定为目标对象。在本公开的示例性实施例中,若注视点位于候选目标对象对应的检测框之内,则判定该候选目标对象的对象位置与注视位置相匹配,将该候选目标对象确定为目标对象。另外,也可以通过获取与注视区域重合度最大的检测框所对应的候选目标对象,将该目标候选对象确定为目标对象。In the exemplary embodiment of the present disclosure, the target object determined in the above embodiment can be more accurately screened by using the gaze area corresponding to the sight line information, so as to determine the target object that the user pays most attention to. Fig. 8 shows a schematic flow chart of a method for determining a target object from candidate objects. As shown in Fig. 8, the flow includes at least steps S810 to S830, and the details are as follows: In step S810, according to the object topology relationship corresponding to the candidate objects As well as description word segmentation, candidate target objects are determined from candidate objects. In an exemplary embodiment of the present disclosure, in the topological relationship between candidate objects, if there are multiple candidate objects matching the description word segmentation, the multiple candidate objects are used as candidate target objects. In step S820, the object position of the candidate target object is matched with the gaze position. In an exemplary embodiment of the present disclosure, the object position of the candidate target object is obtained, and the object position of the candidate target object is matched with the gaze position. If the fixation position is a fixation point, then determine whether the fixation point is within the detection frame determined by the object position of each candidate target object; if the fixation position is a fixation area, calculate the distance between the fixation area and the object position of each candidate target object. Determine the degree of coincidence between the detection boxes. In step S830, when the object position of the candidate target object matches the gaze position, the candidate target object is determined as the target object. In an exemplary embodiment of the present disclosure, if the gaze point is within the detection frame corresponding to the candidate target object, it is determined that the object position of the candidate target object matches the gaze position, and the candidate target object is determined as the target object. In addition, the candidate target object corresponding to the detection frame with the greatest degree of coincidence with the gaze area may also be obtained, and the target candidate object may be determined as the target object.
在本公开的示例性实施例中,在获取到目标图像中各对象的对象信息、语音信息对应的文本信息、以及实现信息对应的注视位置之后。还可以先根据各对象的对象信息和注视位置在各对象中确定备选对象,再根据文本信息在备选对象中确定目标对象。In an exemplary embodiment of the present disclosure, after acquiring the object information of each object in the target image, the text information corresponding to the voice information, and the gaze position corresponding to the realization information. It is also possible to first determine the candidate objects among the objects according to the object information and the gaze position of each object, and then determine the target object from the candidate objects according to the text information.
具体地,图9示出了确定目标对象的方法流程示意图,如图9所示,该流程至少包括步骤S910至步骤S930,详细介绍如下:在步骤S910中,将与注视位置匹配的对象位置确定为目标对象位置。在本公开的示例性实施例中,将目标图像中各对象的对象位置分别与注视位置进行匹配,在对象的对象位置与注视位置相匹配时,将该对象的对象位置确定为目标对象位置。由于各对象的对象位置存在重叠区域,因此,注视位置可能与多个对象位置相匹配,确定的目标对象位置可能存在多个。在步骤S920中,根据目标对象位置在各对象中确定备选对象。在本公开的示例性实施例中,根据目标对象位置,将目标对象位置对应的对象确定为备选对象。在步骤S930中,将各备选对象的对象信息与文本信息进行匹配,根据匹配结果确定目标对象。在本公开的示例性实施例中,上述实施例根据注视位置对多个对象进行了筛选,确定了备选对象。在确定备选对象之后,再将备选对象的对象信息与文本信息进行匹配,在备选对象中确定目标对象。具体地,首先,将与文本信息中的实体分词匹配的各备选对象确定为目标对象类别;然后,根据目标对象类别在各备选对象中确定候选对象;最后,根据候选对象对应的对象拓扑关系以及文本信息中的描述分词,从候选对象中确定目标对象。其中,可以根据各对象的对象类别和对象位置确定对象拓扑关系,在各对象的对象拓扑关系中确定候选对象对应的对象拓扑关系。还可以根据备选对象的对象类别和对象位置确定备选对象的对象拓扑关系,在备选对象的对象拓扑关系中确定候选对象对应的对象拓扑关系。另外,根据备选对象的对象信息与文本信息在备选对象中确定目标对象的详细过程,如上述图7的方法实施例所述,在此不做赘述。Specifically, FIG. 9 shows a schematic flowchart of a method for determining a target object. As shown in FIG. 9 , the process includes at least steps S910 to S930, and the details are as follows: In step S910, the position of the object matching the gaze position is determined. is the target object location. In an exemplary embodiment of the present disclosure, the object positions of the objects in the target image are respectively matched with the gaze positions, and when the object positions of the objects match the gaze positions, the object positions of the objects are determined as the target object positions. Since the object positions of each object have overlapping areas, the gaze position may match with multiple object positions, and there may be multiple determined target object positions. In step S920, a candidate object is determined among the objects according to the target object position. In an exemplary embodiment of the present disclosure, according to the target object position, an object corresponding to the target object position is determined as a candidate object. In step S930, the object information of each candidate object is matched with the text information, and the target object is determined according to the matching result. In the exemplary embodiment of the present disclosure, the above-mentioned embodiments screen a plurality of objects according to the gaze position, and determine candidate objects. After the candidate object is determined, the object information of the candidate object is matched with the text information, and the target object is determined in the candidate object. Specifically, first, each candidate object that matches the entity word segmentation in the text information is determined as the target object category; then, the candidate object is determined in each candidate object according to the target object category; finally, according to the object topology corresponding to the candidate object Relationships and descriptive word segmentation in text information to determine target objects from candidate objects. The object topological relationship may be determined according to the object category and object position of each object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of each object. The object topological relationship of the candidate object may also be determined according to the object category and object position of the candidate object, and the object topological relationship corresponding to the candidate object is determined in the object topological relationship of the candidate object. In addition, the detailed process of determining the target object in the candidate object according to the object information and text information of the candidate object is as described in the method embodiment of FIG. 7 above, and will not be repeated here.
本公开的示例性实施方式中的指令信息获取方法,通过视线信息、语语音信息以及目 标图像的特征信息三个模态的特征信息相融合,以确定指令信息,进一步提升了通过语音信息以及目标图像的特征信息确定的指令信息的准确率。In the method for acquiring instruction information in the exemplary embodiment of the present disclosure, the feature information of three modalities of sight information, speech and voice information, and feature information of the target image are fused to determine the instruction information, which further improves the ability to pass the voice information and the target image. The accuracy of the instruction information determined by the feature information of the image.
继续参照图6所示,在步骤S620中,根据目标对象的对象信息生成指令信息。具体地,图10示出了另一生成指令信息的方法流程示意图,如图10所示,该流程至少包括步骤S1010至步骤S1020,详细介绍如下:在步骤S1010中,根据文本信息确定用户意图信息。在本公开的示例性实施例中,可以通过文本信息识别用户的用户意图信息。对文本信息进行分词处理,以获得一个或多个关键词;根据第一预设映射关系确定关键词对应的用户意图信息。具体地,将文本信息对应的一个或多个关键词分别与第一预设映射关系中的关键词相匹配,获取与文本信息中的关键词相匹配的第一预设映射关系中的关键词所对应的用户意图信息。还可以将文本信息中的动词分词和/或形容词分词与第一预设映射关系进行匹配,提高用户意图信息的获取效率。其中,第一预设映射关系包括关键词与用户意图信息的关联关系。一个关键词可以对应多个用户意图信息,一个用户意图信息也可以对应多个关键词。比如,关键词为“买、想要、好喜欢”,对应的用户意图信息可以为“获取购买链接”;关键词为“什么”,对应的用户意图信息可以为“查询详情信息、获取购买链接”;关键词为“好暗、看不清”,对应的用户意图信息可以为“调节图像的亮度、调节图像的对比度”等等。在步骤S1020中,并根据目标对象的对象信息以及用户意图信息生成指令信息。在本公开的示例性实施例中,根据目标对象的对象信息获取目标对象对应的子目标图像,根据目标对象的对象类别、目标对象的子目标图像以及用户意图信息生成指令信息。在本公开的示例性实施例中,根据指令信息中的用户意图信息获取并显示与目标对象的对象信息相关的对象获取路径。其中,可以根据目标对象的对象类别和/或目标对象的子目标图像查询目标对象的对象获取路径,并将对象获取路径显示在移动终端的显示屏幕上。比如,若用户意图信息为获取购买链接,则可以将目标对象的对象类别和/或子目标图像输入购买平台,并获取购买平台返回的购买链接。另外,还可以根据指令信息中的用户意图信息获取并显示与目标对象的对象信息相关的对象详情信息。或根据指令信息中的用户意图信息获取并显示与目标对象的对象信息相关的对象获取路径对象详情信息。本示例性实施方式中的指令信息获取方法,在用户无法仅仅通过语音信息或目标图像清楚表达需求的情况下,本方案融合了目标图像的特征信息以及语音信息对应的文本信息,得到指令信息,再根据指令信息为用户推荐用户感兴趣的信息。本示例性实施方式可以更准确地确定用户指令信息,从而为用户提供更精准的推荐信息,提升了用户与移动终端的交互体验。Continuing to refer to FIG. 6 , in step S620, instruction information is generated according to the object information of the target object. Specifically, FIG. 10 shows a schematic flowchart of another method for generating instruction information. As shown in FIG. 10 , the process includes at least steps S1010 to S1020, and the details are as follows: In step S1010, the user intent information is determined according to the text information. . In an exemplary embodiment of the present disclosure, user intention information of a user may be identified through text information. Perform word segmentation processing on the text information to obtain one or more keywords; and determine user intent information corresponding to the keywords according to the first preset mapping relationship. Specifically, one or more keywords corresponding to the text information are respectively matched with the keywords in the first preset mapping relationship, and the keywords in the first preset mapping relationship that match the keywords in the text information are obtained. Corresponding user intent information. The verb participle and/or the adjective participle in the text information can also be matched with the first preset mapping relationship, so as to improve the obtaining efficiency of the user intention information. The first preset mapping relationship includes an association relationship between keywords and user intent information. One keyword may correspond to multiple user intent information, and one user intent information may also correspond to multiple keywords. For example, if the keyword is "buy, want, and like it", the corresponding user intent information may be "obtain a purchase link"; if the keyword is "what", the corresponding user intent information may be "inquire about detailed information, obtain a purchase link" ”; the keyword is “too dark, hard to see”, and the corresponding user intent information can be “adjust the brightness of the image, adjust the contrast of the image” and so on. In step S1020, instruction information is generated according to the object information of the target object and the user's intention information. In an exemplary embodiment of the present disclosure, a sub-target image corresponding to the target object is obtained according to object information of the target object, and instruction information is generated according to the object category of the target object, the sub-target image of the target object, and user intent information. In an exemplary embodiment of the present disclosure, an object acquisition path related to the object information of the target object is acquired and displayed according to the user intent information in the instruction information. Wherein, the object acquisition path of the target object can be queried according to the object category of the target object and/or the sub-target image of the target object, and the object acquisition path can be displayed on the display screen of the mobile terminal. For example, if the user's intention information is to obtain a purchase link, the object category and/or sub-target image of the target object can be input into the purchase platform, and the purchase link returned by the purchase platform can be obtained. In addition, the object detail information related to the object information of the target object can also be acquired and displayed according to the user intention information in the instruction information. Or acquire and display the object acquisition path object detail information related to the object information of the target object according to the user intent information in the instruction information. In the method for obtaining instruction information in this exemplary embodiment, in the case where the user cannot clearly express the demand only through the voice information or the target image, this scheme integrates the feature information of the target image and the text information corresponding to the voice information to obtain the instruction information, Then, according to the instruction information, the information that the user is interested in is recommended for the user. The present exemplary embodiment can more accurately determine user instruction information, thereby providing users with more accurate recommendation information, and improving the interaction experience between the user and the mobile terminal.
在本公开的示例性实施例中,在目标图像的特征信息为目标图像的图像参数信息时,根据文本信息确定用户意图信息,并根据用户意图信息和图像参数信息生成参数调整信息。其中,在目标图像的特征信息为目标图像的图像参数信息时,此时的指令信息可以是参数调整信息,移动终端可以根据参数调整信息对目标图像进行参数调整,并在显示屏幕上显示参数调整后的目标图像。举例而言,目标图像对应的图像参数信息为“亮度值为65”,用户的语音信息对应的文本信息为“拍的好暗”,根据文本信息识别出的用户意图信息为“提高图像的亮度”。根据上述实施例中的指令信息获取方法生成的指令信息可以为“对目标图像进行亮度调节,将目标图像的亮度值提高至65+N”。其中,N为正整数,N的取值可以根据实际场景进行设定,本公开对此不作具体限定。In an exemplary embodiment of the present disclosure, when the feature information of the target image is image parameter information of the target image, the user intent information is determined according to the text information, and the parameter adjustment information is generated according to the user intent information and the image parameter information. Wherein, when the feature information of the target image is image parameter information of the target image, the instruction information at this time may be parameter adjustment information, and the mobile terminal may adjust the parameters of the target image according to the parameter adjustment information, and display the parameter adjustment on the display screen. post target image. For example, the image parameter information corresponding to the target image is "brightness value of 65", the text information corresponding to the user's voice information is "shooting is so dark", and the user intent information identified according to the text information is "improve the brightness of the image". ". The instruction information generated according to the instruction information acquisition method in the above-mentioned embodiment may be "adjust the brightness of the target image, and increase the brightness value of the target image to 65+N". Wherein, N is a positive integer, and the value of N can be set according to actual scenarios, which is not specifically limited in the present disclosure.
在本公开的示例性实施例中,在目标图像的特征信息包括目标图像中各对象的对象信息,以及目标图像的图像参数信息时。可以根据文本信息和目标图像的图像参数信息生成参数调整信息;以及根据目标图像中各对象的对象信息和文本信息确定目标对象;根据参数调整信息和目标对象的对象信息生成指令信息。具体地,根据文本信息和目标图像的图像信息生成参数调整信息,以及根据目标图像中各对象的对象信息和文本信息确定目标对象的方法已在上述实施例中进行详细描述,在此不做赘述。其中,可以根据目标对象的对象位置获取目标对象对应的子目标图像,再根据参数调整信息对目标对象的子目标图像进 行参数调整,并显示参数调整后的目标图像或参数调整后的目标对象的子目标图像。另外,还可以根据参数调整后的目标对象的子目标图像,获取目标对象的对象获取路径和对象详情信息,并显示目标对象的对象获取路径和对象详情信息。In an exemplary embodiment of the present disclosure, when the feature information of the target image includes object information of each object in the target image, and image parameter information of the target image. The parameter adjustment information can be generated according to the text information and the image parameter information of the target image; the target object can be determined according to the object information and text information of each object in the target image; the instruction information can be generated according to the parameter adjustment information and the object information of the target object. Specifically, the method of generating parameter adjustment information according to text information and image information of the target image, and determining the target object according to the object information and text information of each object in the target image has been described in detail in the above embodiments, and will not be repeated here. . Among them, the sub-target image corresponding to the target object can be obtained according to the object position of the target object, and then the parameters of the sub-target image of the target object can be adjusted according to the parameter adjustment information, and the target image after parameter adjustment or the target object after parameter adjustment can be displayed. sub-target image. In addition, according to the sub-target image of the target object adjusted by the parameters, the object acquisition path and object detailed information of the target object can be acquired, and the object acquisition path and object detailed information of the target object can be displayed.
在本公开的示例性实施例中,目标图像包括多个,则可以根据上述实施例中的方法确定各目标图像对应的备选指令信息,再根据各目标图像的指令信息确定指令信息。具体地,图11示出了又一生成指令信息的方法流程示意图,如图11所示,该流程至少包括步骤S1110至步骤S1130,详细介绍如下:在步骤S1110中,分别根据各目标图像的特征信息与文本信息确定各目标图像对应的备选指令信息。在本公开的示例性实施例中,多个目标图像来源于采集的目标视频,分别获取各目标图像的特征信息,并获取目标视频中的语音信息并识别语音信息中的文本信息。多个目标图像可以对应一个语音信息,也可以对应多个语音信息,分别根据各目标图像的特征信息与各目标图像对应的语音信息的文本信息确定多个备选指令信息。在步骤S1120中,在各目标图像对应的备选指令信息相同时,将备选指令信息配置为指令信息。在本公开的示例性实施例中,将各目标图像对应的备选指令信息进行匹配,若各目标图像对应的备选指令信息完全匹配,或各备选指令信息之间的匹配度均大于匹配度阈值,则可以将任一备选指令信息配置为指令信息。其中,匹配度阈值可以根据实际情况进行设定,比如,匹配度阈值可以设置为99%,或可以设置为99.5%等,本公开对此不作具体限定。在步骤S1130中,在各目标图像对应的备选指令信息不同时,根据各备选指令信息对应的置信度确定指令信息。在本公开的示例性实施例中,各备选指令信息对应的置信度可以是备选指令信息中的用户意图信息所对应的置信度,也可以是目标对象所对应的置信度,还可以是用户意图信息对应的置信度与目标对象对应的置信度的乘积等,本公开对此不作具体限定。其中,用户意图信息所对应的置信度可以是文本信息中的关键词与第一预设映射关系中关键词的匹配度,目标对象所对应的置信度可以是目标对象的对象类别或对象位置对应的置信度等,还可以是目标对象的对象信息与文本信息之间的匹配度等。此外,在本公开的示例性实施方式所提供的指令信息获取方法中,上述语音信息、目标图像或目标视频、视线信息也可以通过智能助手获取,该智能助手可以为在移动终端上运行的一应用程序。同时,出于操作便捷性的考虑,可以预先设置快速启动智能助手功能。比如,可以在移动终端处于息屏状态时,通过点击开机键三次,即可进入智能助手。另外,还可以通过其他快捷方式进入智能助手,本公开对此不作具体限定。本实施例的指令信息获取方法可以通过快捷方式开启移动终端上的智能助手,简化了开启智能助手的繁琐步骤,使得智能助手的开启更加智能迅速、便捷和准确。In the exemplary embodiment of the present disclosure, if there are multiple target images, the candidate instruction information corresponding to each target image may be determined according to the method in the above embodiment, and then the instruction information may be determined according to the instruction information of each target image. Specifically, FIG. 11 shows a schematic flow chart of another method for generating instruction information. As shown in FIG. 11 , the flow includes at least steps S1110 to S1130, and the details are as follows: In step S1110, according to the characteristics of each target image The information and text information determine candidate instruction information corresponding to each target image. In an exemplary embodiment of the present disclosure, multiple target images are derived from collected target videos, and feature information of each target image is acquired respectively, voice information in the target video is acquired, and text information in the voice information is recognized. The multiple target images may correspond to one piece of voice information, or may correspond to multiple pieces of voice information, and multiple candidate instruction information is determined according to the feature information of each target image and the text information of the voice information corresponding to each target image, respectively. In step S1120, when the candidate instruction information corresponding to each target image is the same, the candidate instruction information is configured as the instruction information. In the exemplary embodiment of the present disclosure, the candidate instruction information corresponding to each target image is matched, if the candidate instruction information corresponding to each target image completely matches, or the degree of matching between each candidate instruction information is greater than the matching degree If the degree threshold is set, any candidate instruction information can be configured as instruction information. The matching degree threshold may be set according to the actual situation, for example, the matching degree threshold may be set to 99%, or may be set to 99.5%, etc., which is not specifically limited in the present disclosure. In step S1130, when the candidate instruction information corresponding to each target image is different, the instruction information is determined according to the confidence level corresponding to each candidate instruction information. In the exemplary embodiment of the present disclosure, the confidence level corresponding to each candidate instruction information may be the confidence level corresponding to the user intent information in the candidate instruction information, the confidence level corresponding to the target object, or the confidence level corresponding to the target object. The product of the confidence level corresponding to the user intent information and the confidence level corresponding to the target object, etc., is not specifically limited in the present disclosure. The confidence level corresponding to the user intent information may be the degree of matching between the keywords in the text information and the keywords in the first preset mapping relationship, and the confidence level corresponding to the target object may be the object category or the object position of the target object. The confidence of the target object, etc., can also be the matching degree between the object information of the target object and the text information. In addition, in the instruction information acquisition method provided by the exemplary embodiment of the present disclosure, the above-mentioned voice information, target image or target video, and sight line information can also be acquired through an intelligent assistant, and the intelligent assistant can be a mobile terminal running on a mobile terminal. application. At the same time, for the convenience of operation, the quick-start smart assistant function can be preset. For example, when the mobile terminal is in the screen-off state, the smart assistant can be entered by clicking the power button three times. In addition, other shortcuts can also be used to enter the smart assistant, which is not specifically limited in the present disclosure. The instruction information acquisition method of this embodiment can activate the intelligent assistant on the mobile terminal through a shortcut, which simplifies the tedious steps of launching the intelligent assistant, and makes the intelligent assistant activation more intelligent, rapid, convenient and accurate.
下面结合具体场景对本示例实施方式中的指令信息获取方法进行详细的说明,图12示出了本公开的一具体实施例的指令信息获取的方法流程示意图,如图12所示:在步骤S1201中,获取目标图像,对目标图像进行对象提取,并获取各对象的对象信息,其中,对象信息包括对象类别和对象位置;在步骤S1203中,根据各对象的对象类别和对象位置确定对象拓扑关系;在步骤S1205中,获取与目标图像相关联的语音信息,并确定语音信息对应的文本信息,其中,文本信息包括实体分词和描述分词;在步骤S1207中,对文本信息进行分词处理,以得到一个或多个关键词,其中,关键词包括实体分词和描述分词;在步骤S1209中,将与实体分词匹配的对象类别确定为目标对象类别;在步骤S1211中,根据目标对象类别在各对象中确定候选对象;在步骤S1213中,根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定目标对象;在步骤S1215中,根据文本信息确定用户意图信息;在步骤S1217中,根据目标对象的对象信息以及用户意图信息生成指令信息。The method for acquiring instruction information in this example embodiment will be described in detail below in combination with specific scenarios. FIG. 12 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 12 : in step S1201 , obtain the target image, perform object extraction on the target image, and obtain the object information of each object, wherein the object information includes the object category and the object position; In step S1203, determine the object topology relationship according to the object category and the object position of each object; In step S1205, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1207, word segmentation processing is performed on the text information to obtain a or multiple keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1209, the object category that matches the entity word segmentation is determined as the target object category; in step S1211, according to the target object category is determined in each object candidate object; in step S1213, the target object is determined from the candidate objects according to the object topological relationship and description word segmentation corresponding to the candidate object; in step S1215, the user intention information is determined according to the text information; in step S1217, according to the target object's Object information and user intent information generate instruction information.
下面结合另一具体场景对本示例实施方式中的指令信息获取方法进行详细的说明,图13示出了本公开的一具体实施例的指令信息获取的方法流程示意图,如图13所示:在步骤S1301中,获取目标图像,对目标图像进行对象提取,并获取各对象的对象信息,其中, 对象信息包括对象类别和对象位置;在步骤S1303中,获取与目标图像相关联的语音信息,并确定语音信息对应的文本信息,其中,文本信息包括实体分词和描述分词;在步骤S1305中,对文本信息进行分词处理,以得到一个或多个关键词,其中,关键词包括实体分词和描述分词;在步骤S1307中,获取与目标图像相关联的视线信息,并确定视线信息对应的注视位置;在步骤S1309中,将与注视位置匹配的对象位置确定为目标对象位置;在步骤S1311中,根据目标对象位置在各对象中确定备选对象;在步骤S1313中,根据各备选对象的对象类别和对象位置确定备选对象拓扑关系;在步骤S1315中,将与实体分词匹配的对象类别确定为目标对象类别;在步骤S1317中,根据目标对象类别在各备选对象中确定候选对象;在步骤S1319中,根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定目标对象;在步骤S1321中,根据文本信息确定用户意图信息;在步骤S1323中,根据目标对象的对象信息以及用户意图信息生成指令信息。The method for acquiring instruction information in this example embodiment will be described in detail below in conjunction with another specific scenario. FIG. 13 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 13 : in step In S1301, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1303, acquire voice information associated with the target image, and determine Text information corresponding to the voice information, wherein the text information includes entity word segmentation and description word segmentation; in step S1305, word segmentation processing is performed on the text information to obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; In step S1307, the sight line information associated with the target image is obtained, and the gaze position corresponding to the sight line information is determined; in step S1309, the object position matching the gaze position is determined as the target object position; in step S1311, according to the target The object position determines the candidate object in each object; in step S1313, the candidate object topology relationship is determined according to the object category and object position of each candidate object; in step S1315, the object category that matches the entity word segmentation is determined as the target Object category; in step S1317, a candidate object is determined from each candidate object according to the target object category; in step S1319, a target object is determined from the candidate objects according to the object topological relationship and description word segmentation corresponding to the candidate object; in step S1321 , the user intention information is determined according to the text information; in step S1323 , instruction information is generated according to the object information of the target object and the user intention information.
下面结合又一具体场景对本示例实施方式中的指令信息获取方法进行详细的说明,图14示出了本公开的一具体实施例的指令信息获取的方法流程示意图,如图14所示:在步骤S1401中,获取目标图像,对目标图像进行对象提取,并获取各对象的对象信息,其中,对象信息包括对象类别和对象位置;在步骤S1403中,根据各对象的对象类别和对象位置确定对象拓扑关系;在步骤S1405中,获取与目标图像相关联的语音信息,并确定语音信息对应的文本信息,其中,文本信息包括实体分词和描述分词;在步骤S1407中,对文本信息进行分词处理,以得到一个或多个关键词,其中,关键词包括实体分词和描述分词;在步骤S1409中,获取与目标图像相关联的视线信息,并确定视线信息对应的注视位置;在步骤S1411中,将与实体分词匹配的对象类别确定为目标对象类别;在步骤S1413中,根据目标对象类别在各对象中确定候选对象;在步骤S1415中,根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定候选目标对象;在步骤S1417中,将候选目标对象的对象位置与注视位置相匹配;在步骤S1419中,在候选目标对象的对象位置与注视位置相匹配时,将候选目标对象确定为目标对象;在步骤S1421中,根据文本信息确定用户意图信息;在步骤S1423中,根据目标对象的对象信息以及用户意图信息生成指令信息。The method for acquiring instruction information in this exemplary embodiment will be described in detail below in conjunction with another specific scenario. FIG. 14 shows a schematic flowchart of a method for acquiring instruction information according to a specific embodiment of the present disclosure, as shown in FIG. 14 : in step In S1401, acquire a target image, perform object extraction on the target image, and acquire object information of each object, wherein the object information includes object category and object position; in step S1403, determine object topology according to the object category and object position of each object In step S1405, the voice information associated with the target image is obtained, and the text information corresponding to the voice information is determined, wherein the text information includes entity word segmentation and description word segmentation; in step S1407, word segmentation processing is performed on the text information to Obtain one or more keywords, wherein the keywords include entity word segmentation and description word segmentation; in step S1409, obtain the line of sight information associated with the target image, and determine the gaze position corresponding to the line of sight information; in step S1411, with The object category matched by the entity word segmentation is determined as the target object category; in step S1413, a candidate object is determined in each object according to the target object category; in step S1415, according to the object topological relationship corresponding to the candidate object and the description word segmentation, from the candidate object Determine the candidate target object; in step S1417, match the object position of the candidate target object with the gaze position; in step S1419, when the object position of the candidate target object matches the gaze position, determine the candidate target object as the target object ; In step S1421, determine the user intent information according to the text information; in step S1423, generate instruction information according to the object information of the target object and the user intent information.
举例而言,目标图像如图15所示,根据目标检测算法识别目标图像1500,目标图像1500中包括4个对象,4个对象的对象类别分别是“花”、“花盆1”、“花盆2”、“饮水机”;用户的语音信息对应的文本信息为“我想买左边那个花盆”,通过文本信息识别出用户意图信息为“获取购买路径”。以及获取到用户针对目标图像1500的注视位置1501。在获取到目标图像1500中对象的对象类别和对象位置之后,根据对象类别和对象位置确定对象拓扑关系为:“图像的最左边是盆花”“花的下面是花盆1”“图像的最右边是饮水机”“饮水机的左边是花盆2”。For example, the target image is shown in FIG. 15 . The target image 1500 is identified according to the target detection algorithm. The target image 1500 includes 4 objects, and the object categories of the 4 objects are "flower", "flower pot 1", "flower" Pot 2", "water dispenser"; the text information corresponding to the user's voice information is "I want to buy the flower pot on the left", and the user's intention information is identified through the text information as "obtain the purchase path". And the gaze position 1501 of the user on the target image 1500 is acquired. After acquiring the object category and object position of the object in the target image 1500, the topological relationship of the object is determined according to the object category and object position as follows: "the leftmost of the image is a potted flower" "the bottom of the flower is the flowerpot 1" "the rightmost of the image is It is the water dispenser" "The left side of the water dispenser is the flower pot 2".
首先,将文本信息进行分词处理后,得到名词分词“花盆”,得到副词分词“左边”“那个”;然后,将名词分词与对象类别进行匹配,确定目标对象类别“花盆”,将确定候选对象为“花盆1”和“花盆2”,将候选对象对应的对象拓扑关系与副词分词进行匹配,候选目标对象为“花盆1”和“花盆2”;接着,将候选目标对象“花盆1”和“花盆2”的对象位置与注视位置进行匹配,确定用户最关注的目标对象为“花盆1”。最后,生成指令信息“搜索花盆1的同款”,则移动终端将“花盆1”对应的子目标图像发送至相应的购物网站,以得到与“花盆1”相关的同款购物链接。First, after the text information is segmented, the noun segment "flower pot" is obtained, and the adverb segment "left" and "that" are obtained; then, the noun segment is matched with the object category to determine the target object category "flower pot", which will be determined The candidate objects are "flower pot 1" and "flower pot 2", and the object topological relationship corresponding to the candidate object is matched with the adverb participle, and the candidate target objects are "flower pot 1" and "flower pot 2"; The object positions of the objects "flower pot 1" and "flower pot 2" are matched with the gaze positions, and the target object that the user pays most attention to is determined as "flower pot 1". Finally, the instruction information "search for the same item of flowerpot 1" is generated, and the mobile terminal sends the sub-target image corresponding to "flowerpot 1" to the corresponding shopping website, so as to obtain the shopping link of the same item related to "flowerpot 1" .
以下介绍本公开的装置实施例,可以用于执行本公开上述的指令信息获取方法。对于本公开装置实施例中未披露的细节,请参照本公开上述的指令信息获取方法的实施例。图16示意性示出了根据本公开的一个实施例的指令信息获取装置的框图。The apparatus embodiments of the present disclosure are described below, which can be used to execute the above-mentioned command information acquisition method of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the above-mentioned embodiments of the method for obtaining instruction information in the present disclosure. FIG. 16 schematically shows a block diagram of an apparatus for acquiring instruction information according to an embodiment of the present disclosure.
参照图16所示,根据本公开的一个实施例的指令信息获取装置1600,指令信息获取装置1600包括:图像信息提取模块1601、文本信息获取模块1602、以及指令信息生成模块1603。具体地:图像信息提取模块1601,用于获取目标图像并对目标图像进行信息提 取,以得到目标图像的特征信息;文本信息获取模块1602,用于获取语音信息并识别语音信息对应的文本信息,其中,语音信息为与目标图像相关联的信息;指令信息生成模块1603,用于根据文本信息以及目标图像的特征信息生成指令信息。Referring to FIG. 16 , according to an instruction information acquisition apparatus 1600 according to an embodiment of the present disclosure, the instruction information acquisition apparatus 1600 includes an image information extraction module 1601 , a text information acquisition module 1602 , and an instruction information generation module 1603 . Specifically: the image information extraction module 1601 is used to obtain the target image and perform information extraction on the target image to obtain the feature information of the target image; the text information acquisition module 1602 is used to obtain the voice information and identify the text information corresponding to the voice information, The voice information is information associated with the target image; the instruction information generating module 1603 is configured to generate instruction information according to the text information and feature information of the target image.
在本公开的示例性实施例中,图像信息提取模块1601,还可以用于对目标图像进行对象提取,获取目标图像中各对象的对象信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于分别将各对象信息与文本信息进行匹配,根据匹配结果确定目标对象;根据目标对象的对象信息生成指令信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于根据对象类别和对象位置确定对象拓扑关系;将与实体分词匹配的对象类别确定为目标对象类别;根据目标对象类别在各对象中确定候选对象;根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定目标对象。其中,对象信息包括对象类别和对象位置,文本信息包括实体分词以及描述分词。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于获取视线信息并确定视线信息对应的注视位置;根据候选对象对应的对象拓扑关系以及描述分词,从候选对象中确定候选目标对象;将候选目标对象的对象位置与注视位置进行匹配;在候选目标对象的对象位置与注视位置相匹配时,将候选目标对象确定为目标对象。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于将与注视位置匹配的对象位置确定为目标对象位置;根据目标对象位置在各对象中确定备选对象;将各备选对象的对象信息与文本信息进行匹配,根据匹配结果确定目标对象。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于根据文本信息确定用户意图信息;并根据目标对象的对象信息以及用户意图信息生成指令信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于对文本信息进行分词处理,以获得一个或多个关键词;根据第一预设映射关系确定关键词对应的用户意图信息,第一预设映射关系包括关键词与用户意图信息的关联关系。在本公开的示例性实施例中,图像信息提取模块1601,还可以用于获取各对象的对象位置、各对象的第一预测类别,以及第一预测类别对应的第一置信度;根据对象位置获取各对象的特征向量,并根据第二预设映射关系确定各对象的第二预测类别以及第二预测类别对应的第二置信度;根据第一预测类别和第二预测类别,以及第一预测类别对应的第一置信度和第二预测类别对应的第二置信度确定各对象的对象类别;其中,第二预设映射关系包括特征向量与第二预测类别的关联关系。在本公开的示例性实施例中,图像信息提取模块1601,还可以用于根据对象位置对目标图像进行裁剪,以得到与各对象对应的子目标图像;对子目标图像进行特征提取,以得到各对象的特征向量。在本公开的示例性实施例中,图像信息提取模块1601,还可以用于判断第一预测类别与第二预测类别是否相同;在第一预测类别与第二预测类别相同时,将第一预测类别配置为各对象的对象类别;在第一预测类别与第二预测类别不同时,判断第一置信度是否大于第二置信度,根据判断结果确定各对象的对象类别。在本公开的示例性实施例中,图像信息提取模块1601,还可以用于在第一置信度大于第二置信度时,将第一预测类别配置为各对象的对象类别;在第一置信度小于等于第二置信度时,将第二预测类别配置为各对象的对象类别。在本公开的示例性实施例中,图像信息提取模块1601,还可以用于对目标图像进行信息提取,以得到目标图像的图像参数信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于根据文本信息确定用户意图信息;并根据用户意图信息和图像参数信息生成参数调整信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于根据文本信息和目标图像的图像参数信息生成参数调整信息;以及根据目标图像中各对象的对象信息和文本信息确定目标对象;根据参数调整信息和目标对象的对象信息生成指令信息。在本公开的示例性实施例中,指令信息生成模块1603,还可以用于分别根据各目标图像的特征信息与文本信息确定各目标图像对应的备选指令信息;在各目标图像对应的备选指令信息相同时,将备选指令信息配置为指令信息;在各目标图像对应的备选指令信息不同时,根据各备选指令信息对应的置信度确定指令信息。其 中,目标图像包括多个。在本公开的示例性实施例中,指令信息获取装置还可以包括信息显示模块(图中未示出),该信息显示模块用于根据指令信息中的用户意图信息获取并显示与目标对象的对象信息相关的对象获取路径;和/或根据指令信息中的用户意图信息获取并显示与目标对象的对象信息相关的对象详情信息。在本公开的示例性实施例中,信息显示模块还可以用于根据参数调整信息对目标图像进行参数调整,并显示参数调整后的目标图像。In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to perform object extraction on the target image to obtain object information of each object in the target image. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to match each object information with text information respectively, determine the target object according to the matching result, and generate instruction information according to the object information of the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to determine the object topological relationship according to the object category and the object position; determine the object category that matches the entity word segmentation as the target object category; A candidate object is determined from each object; a target object is determined from the candidate objects according to the object topological relationship corresponding to the candidate object and the description word segmentation. The object information includes object category and object location, and the text information includes entity word segmentation and description word segmentation. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to obtain line-of-sight information and determine the gaze position corresponding to the line-of-sight information; according to the object topology relationship corresponding to the candidate object and the description word segmentation, determine the candidate object from the candidate object target object; match the object position of the candidate target object with the gaze position; when the object position of the candidate target object matches the gaze position, determine the candidate target object as the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to determine the position of the object that matches the gaze position as the target object position; determine candidate objects in each object according to the target object position; The object information of the selected object is matched with the text information, and the target object is determined according to the matching result. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate instruction information according to the object information of the target object and the user intent information. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be used to perform word segmentation processing on the text information to obtain one or more keywords; determine the user intent corresponding to the keywords according to the first preset mapping relationship information, and the first preset mapping relationship includes an association relationship between keywords and user intent information. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to obtain the object position of each object, the first predicted category of each object, and the first confidence level corresponding to the first predicted category; Obtain the feature vector of each object, and determine the second predicted category of each object and the second confidence level corresponding to the second predicted category according to the second preset mapping relationship; according to the first predicted category and the second predicted category, and the first predicted category The first confidence level corresponding to the category and the second confidence level corresponding to the second predicted category determine the object category of each object; wherein, the second preset mapping relationship includes an association relationship between the feature vector and the second predicted category. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to crop the target image according to the position of the object to obtain sub-target images corresponding to each object; perform feature extraction on the sub-target images to obtain The eigenvectors of each object. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 can also be used to determine whether the first predicted category is the same as the second predicted category; when the first predicted category is the same as the second predicted category, the first predicted category The category is configured as the object category of each object; when the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each object is determined according to the judgment result. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to configure the first predicted category as the object category of each object when the first confidence level is greater than the second confidence level; When less than or equal to the second confidence level, the second predicted category is configured as the object category of each object. In an exemplary embodiment of the present disclosure, the image information extraction module 1601 may also be configured to perform information extraction on the target image to obtain image parameter information of the target image. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine user intent information according to text information; and generate parameter adjustment information according to the user intent information and image parameter information. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 can also be configured to generate parameter adjustment information according to the text information and image parameter information of the target image; and determine the target according to the object information and text information of each object in the target image Object; generate instruction information according to parameter adjustment information and object information of the target object. In an exemplary embodiment of the present disclosure, the instruction information generation module 1603 may also be configured to determine alternative instruction information corresponding to each target image according to the feature information and text information of each target image; When the instruction information is the same, the candidate instruction information is configured as the instruction information; when the candidate instruction information corresponding to each target image is different, the instruction information is determined according to the confidence level corresponding to each candidate instruction information. Among them, the target image includes multiple. In an exemplary embodiment of the present disclosure, the instruction information acquisition apparatus may further include an information display module (not shown in the figure), the information display module is configured to acquire and display objects related to the target object according to the user intention information in the instruction information information-related object acquisition path; and/or acquire and display object detailed information related to the object information of the target object according to the user intent information in the instruction information. In an exemplary embodiment of the present disclosure, the information display module may also be configured to perform parameter adjustment on the target image according to the parameter adjustment information, and display the parameter-adjusted target image.
上述指令信息获取装置中各模块的具体细节在指令信息获取方法部分实施方式中已经详细说明,未披露的细节内容可以参见指令信息获取方法部分的实施方式内容,因而不再赘述。The specific details of each module in the above-mentioned command information obtaining apparatus have been described in detail in the part of the embodiment of the command information obtaining method, and the undisclosed details can be referred to the content of the embodiment of the part of the command information obtaining method, and thus will not be repeated.
本公开的示例性实施方式还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在移动终端上运行时,程序代码用于使移动终端执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤,例如可以执行图3至图14中任意一个或多个步骤。Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible implementations, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a mobile terminal, the program code is used to cause the mobile terminal to execute the above-mentioned procedures in this specification. The steps described in the "Exemplary Methods" section according to various exemplary embodiments of the present disclosure, for example, any one or more of the steps in FIGS. 3 to 14 may be performed.
本公开的示例性实施方式还提供了一种用于实现上述方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在移动终端,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Exemplary embodiments of the present disclosure also provide a program product for implementing the above method, which may adopt a portable compact disc read only memory (CD-ROM) and include program codes, and may be stored on a mobile terminal such as a personal computer run. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施方式。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施方式仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and embodiments are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the claims.

Claims (21)

  1. 一种指令信息获取方法,其特征在于,包括:A method for obtaining instruction information, comprising:
    获取目标图像并对所述目标图像进行信息提取,以得到所述目标图像的特征信息;以及Obtaining a target image and performing information extraction on the target image to obtain feature information of the target image; and
    获取语音信息并识别所述语音信息对应的文本信息,其中,所述语音信息为与所述目标图像相关联的信息;acquiring voice information and identifying text information corresponding to the voice information, wherein the voice information is information associated with the target image;
    根据所述文本信息以及所述目标图像的特征信息生成指令信息。The instruction information is generated according to the text information and the feature information of the target image.
  2. 根据权利要求1所述的指令信息获取方法,其特征在于,对所述目标图像进行信息提取,以得到所述目标图像的特征信息,包括:The method for obtaining instruction information according to claim 1, wherein, performing information extraction on the target image to obtain feature information of the target image, comprising:
    对所述目标图像进行对象提取,获取所述目标图像中各对象的对象信息。Object extraction is performed on the target image, and object information of each object in the target image is acquired.
  3. 根据权利要求2所述的指令信息获取方法,其特征在于,根据所述文本信息以及所述目标图像的特征信息生成指令信息,包括:The method for obtaining instruction information according to claim 2, wherein generating instruction information according to the text information and feature information of the target image, comprising:
    分别将各所述对象信息与所述文本信息进行匹配,根据匹配结果确定目标对象;Matching each of the object information and the text information respectively, and determining the target object according to the matching result;
    根据所述目标对象的对象信息生成所述指令信息。The instruction information is generated according to object information of the target object.
  4. 根据权利要求3所述的指令信息获取方法,其特征在于,所述对象信息包括对象类别和对象位置,所述文本信息包括实体分词以及描述分词;The method for obtaining instruction information according to claim 3, wherein the object information includes an object category and an object position, and the text information includes an entity word segmentation and a description word segmentation;
    分别将各所述对象信息与所述文本信息进行匹配,根据匹配结果确定目标对象,包括:Match each of the object information with the text information respectively, and determine the target object according to the matching result, including:
    根据所述对象类别和所述对象位置确定对象拓扑关系;determining an object topological relationship according to the object category and the object position;
    将与所述实体分词匹配的对象类别确定为目标对象类别;Determining the object category that matches the entity word segmentation as the target object category;
    根据所述目标对象类别在各所述对象中确定候选对象;determining candidate objects in each of the objects according to the target object category;
    根据所述候选对象对应的对象拓扑关系以及所述描述分词,从所述候选对象中确定目标对象。The target object is determined from the candidate objects according to the object topological relationship corresponding to the candidate objects and the description word segmentation.
  5. 根据权利要求4所述的指令信息获取方法,其特征在于,根据所述候选对象对应的对象拓扑关系以及所述描述分词,从所述候选对象中确定目标对象,包括:The method for obtaining instruction information according to claim 4, wherein determining the target object from the candidate objects according to the object topological relationship corresponding to the candidate object and the description word segmentation, comprising:
    获取视线信息并确定所述视线信息对应的注视位置;acquiring sight line information and determining a gaze position corresponding to the sight line information;
    根据所述候选对象对应的对象拓扑关系以及所述描述分词,从所述候选对象中确定候选目标对象;Determine candidate target objects from the candidate objects according to the object topological relationship corresponding to the candidate objects and the description word segmentation;
    将所述候选目标对象的对象位置与所述注视位置进行匹配;matching the object position of the candidate target object with the gaze position;
    在所述候选目标对象的对象位置与所述注视位置相匹配时,将所述候选目标对象确定为所述目标对象。When the object position of the candidate target object matches the gaze position, the candidate target object is determined as the target object.
  6. 根据权利要求3所述的指令信息获取方法,其特征在于,分别将各所述对象信息与所述文本信息进行匹配,根据匹配结果确定目标对象,包括:The method for obtaining instruction information according to claim 3, wherein the object information is matched with the text information respectively, and the target object is determined according to the matching result, comprising:
    将与注视位置匹配的对象位置确定为目标对象位置;Determine the object position matching the gaze position as the target object position;
    根据所述目标对象位置在各所述对象中确定备选对象;determining candidate objects in each of the objects according to the target object position;
    将各所述备选对象的对象信息与所述文本信息进行匹配,根据匹配结果确定所述目标对象。The object information of each candidate object is matched with the text information, and the target object is determined according to the matching result.
  7. 根据权利要求3所述的指令信息获取方法,其特征在于,根据所述目标对象的对象信息生成所述指令信息,包括:The method for obtaining instruction information according to claim 3, wherein generating the instruction information according to the object information of the target object comprises:
    根据所述文本信息确定用户意图信息;Determine user intent information according to the text information;
    并根据所述目标对象的对象信息以及用户意图信息生成所述指令信息。The instruction information is generated according to the object information of the target object and the user intention information.
  8. 根据权利要求7所述的指令信息获取方法,其特征在于,根据所述文本信息确定用户意图信息,包括:The method for obtaining instruction information according to claim 7, wherein determining the user intent information according to the text information comprises:
    对所述文本信息进行分词处理,以获得一个或多个关键词;Perform word segmentation processing on the text information to obtain one or more keywords;
    根据第一预设映射关系确定所述关键词对应的用户意图信息,所述第一预设映射关系包括所述关键词与所述用户意图信息的关联关系。The user intent information corresponding to the keyword is determined according to a first preset mapping relationship, where the first preset mapping relationship includes an association relationship between the keyword and the user intent information.
  9. 根据权利要求7所述的指令信息获取方法,其特征在于,所述方法还包括:The method for obtaining instruction information according to claim 7, wherein the method further comprises:
    根据所述指令信息中的用户意图信息获取并显示与所述目标对象的对象信息相关的对象获取路径;和/或Acquire and display an object acquisition path related to the object information of the target object according to the user intent information in the instruction information; and/or
    根据所述指令信息中的用户意图信息获取并显示与所述目标对象的对象信息相关的对象详情信息。The object detail information related to the object information of the target object is acquired and displayed according to the user intention information in the instruction information.
  10. 根据权利要求2所述的指令信息获取方法,其特征在于,获取所述目标图像中各对象的对象信息,包括:The method for obtaining instruction information according to claim 2, wherein obtaining the object information of each object in the target image comprises:
    获取各所述对象的对象位置、各所述对象的第一预测类别,以及所述第一预测类别对应的第一置信度;acquiring the object position of each of the objects, the first predicted category of each of the objects, and the first confidence level corresponding to the first predicted category;
    根据所述对象位置获取各所述对象的特征向量,并根据第二预设映射关系确定各所述对象的第二预测类别以及所述第二预测类别对应的第二置信度;Obtain the feature vector of each of the objects according to the position of the object, and determine the second predicted category of each of the objects and the second confidence level corresponding to the second predicted category according to the second preset mapping relationship;
    根据所述第一预测类别和所述第二预测类别,以及所述第一预测类别对应的第一置信度和所述第二预测类别对应的第二置信度确定各所述对象的对象类别;Determine the object category of each of the objects according to the first predicted category and the second predicted category, as well as the first confidence level corresponding to the first predicted category and the second confidence level corresponding to the second predicted category;
    其中,所述第二预设映射关系包括所述特征向量与所述第二预测类别的关联关系。The second preset mapping relationship includes an association relationship between the feature vector and the second prediction category.
  11. 根据权利要求10所述的指令信息获取方法,其特征在于,根据所述对象位置获取各所述对象的特征向量,包括:The method for obtaining instruction information according to claim 10, wherein obtaining the feature vector of each object according to the position of the object, comprising:
    根据所述对象位置对所述目标图像进行裁剪,以得到与各所述对象对应的子目标图像;Cropping the target image according to the position of the object to obtain a sub-target image corresponding to each of the objects;
    对所述子目标图像进行特征提取,以得到各所述对象的特征向量。Feature extraction is performed on the sub-target images to obtain feature vectors of each of the objects.
  12. 根据权利要求11所述的指令信息获取方法,其特征在于,根据所述第一预测类别和所述第二预测类别,以及所述第一预测类别对应的第一置信度和所述第二预测类别对应的第二置信度确定各所述对象的对象类别,包括:The method for acquiring instruction information according to claim 11, wherein the first prediction category and the second prediction category, and the first confidence level corresponding to the first prediction category and the second prediction The second confidence level corresponding to the category determines the object category of each of the objects, including:
    判断所述第一预测类别与所述第二预测类别是否相同;judging whether the first prediction category is the same as the second prediction category;
    在所述第一预测类别与所述第二预测类别相同时,将所述第一预测类别或所述第二预测类别配置为各所述对象的对象类别;When the first predicted category is the same as the second predicted category, configuring the first predicted category or the second predicted category as the object category of each of the objects;
    在所述第一预测类别与所述第二预测类别不同时,判断所述第一置信度是否大于所述第二置信度,根据判断结果确定各所述对象的对象类别。When the first predicted category is different from the second predicted category, it is determined whether the first confidence level is greater than the second confidence level, and the object category of each of the objects is determined according to the judgment result.
  13. 根据权利要求12所述的指令信息获取方法,其特征在于,根据判断结果确定各所述对象的对象类别,包括:The method for obtaining instruction information according to claim 12, wherein determining the object category of each of the objects according to the judgment result, comprising:
    在所述第一置信度大于所述第二置信度时,将所述第一预测类别配置为各所述对象的对象类别;When the first confidence level is greater than the second confidence level, configuring the first predicted category as an object category of each of the objects;
    在所述第一置信度小于等于所述第二置信度时,将所述第二预测类别配置为各所述对象的对象类别。When the first confidence level is less than or equal to the second confidence level, the second predicted category is configured as an object category of each of the objects.
  14. 根据权利要求1所述的指令信息获取方法,其特征在于,对所述目标图像进行特征信息提取,包括:The method for obtaining instruction information according to claim 1, wherein extracting feature information on the target image comprises:
    对所述目标图像进行信息提取,以得到所述目标图像的图像参数信息。Information extraction is performed on the target image to obtain image parameter information of the target image.
  15. 根据权利要求14所述的指令信息获取方法,其特征在于,根据所述文本信息以及所述目标图像的特征信息生成指令信息,包括:The method for obtaining instruction information according to claim 14, wherein generating instruction information according to the text information and feature information of the target image, comprising:
    根据所述文本信息确定用户意图信息;Determine user intent information according to the text information;
    并根据所述用户意图信息和所述图像参数信息生成参数调整信息。and generating parameter adjustment information according to the user intention information and the image parameter information.
  16. 根据权利要求15所述的指令信息获取方法,其特征在于,所述方法还包括:The method for obtaining instruction information according to claim 15, wherein the method further comprises:
    根据所述参数调整信息对所述目标图像进行参数调整,并显示参数调整后的目标图像。Parameter adjustment is performed on the target image according to the parameter adjustment information, and the parameter-adjusted target image is displayed.
  17. 根据权利要求1所述的指令信息获取方法,其特征在于,根据所述文本信息以及所述目标图像的特征信息生成指令信息,包括:The method for obtaining instruction information according to claim 1, wherein generating instruction information according to the text information and feature information of the target image, comprising:
    根据所述文本信息和所述目标图像的图像参数信息生成参数调整信息;以及generating parameter adjustment information according to the text information and image parameter information of the target image; and
    根据所述目标图像中各对象的对象信息和所述文本信息确定目标对象;Determine the target object according to the object information of each object in the target image and the text information;
    根据所述参数调整信息和所述目标对象的对象信息生成所述指令信息。The instruction information is generated according to the parameter adjustment information and object information of the target object.
  18. 根据权利要求1所述的指令信息获取方法,其特征在于,所述目标图像包括多个;The method for obtaining instruction information according to claim 1, wherein the target image comprises a plurality of;
    根据所述文本信息以及所述目标图像的特征信息生成指令信息,包括:Generate instruction information according to the text information and the feature information of the target image, including:
    分别根据各所述目标图像的特征信息与所述文本信息确定各所述目标图像对应的备选指令信息;Determine the candidate instruction information corresponding to each of the target images according to the feature information and the text information of each of the target images respectively;
    在各所述目标图像对应的备选指令信息相同时,将所述备选指令信息配置为所述指令信息;When the candidate instruction information corresponding to each of the target images is the same, configure the candidate instruction information as the instruction information;
    在各所述目标图像对应的备选指令信息不同时,根据各所述备选指令信息对应的置信度确定所述指令信息。When the candidate instruction information corresponding to each of the target images is different, the instruction information is determined according to the confidence level corresponding to each of the candidate instruction information.
  19. 一种指令信息获取装置,其特征在于,包括:A device for obtaining instruction information, comprising:
    图像信息提取模块,用于获取目标图像并对所述目标图像进行信息提取,以得到所述目标图像的特征信息;an image information extraction module, used for acquiring a target image and performing information extraction on the target image to obtain feature information of the target image;
    文本信息获取模块,用于获取语音信息并识别所述语音信息对应的文本信息,其中,所述语音信息为与所述目标图像相关联的信息;a text information acquisition module, configured to acquire voice information and identify text information corresponding to the voice information, wherein the voice information is information associated with the target image;
    指令信息生成模块,用于根据所述文本信息以及所述目标图像的特征信息生成指令信息。The instruction information generating module is configured to generate instruction information according to the text information and the feature information of the target image.
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1至18中任一项所述的指令信息获取方法。A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method for acquiring instruction information according to any one of claims 1 to 18 is implemented.
  21. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    一个或多个处理器;one or more processors;
    存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1至18中任一项所述的指令信息获取方法。A storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement any one of claims 1 to 18 The method for acquiring instruction information described in one item.
PCT/CN2022/077138 2021-03-18 2022-02-21 Instruction information acquisition method and apparatus, readable storage medium, and electronic device WO2022193911A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110292701.7A CN113031813A (en) 2021-03-18 2021-03-18 Instruction information acquisition method and device, readable storage medium and electronic equipment
CN202110292701.7 2021-03-18

Publications (1)

Publication Number Publication Date
WO2022193911A1 true WO2022193911A1 (en) 2022-09-22

Family

ID=76471590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077138 WO2022193911A1 (en) 2021-03-18 2022-02-21 Instruction information acquisition method and apparatus, readable storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN113031813A (en)
WO (1) WO2022193911A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031813A (en) * 2021-03-18 2021-06-25 Oppo广东移动通信有限公司 Instruction information acquisition method and device, readable storage medium and electronic equipment
CN115482807A (en) * 2022-08-11 2022-12-16 天津大学 Detection method and system for voice interaction of intelligent terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389005A (en) * 2017-08-05 2019-02-26 富泰华工业(深圳)有限公司 Intelligent robot and man-machine interaction method
KR20190056174A (en) * 2017-11-16 2019-05-24 서울시립대학교 산학협력단 Robot sytem and control method thereof
CN110489746A (en) * 2019-07-31 2019-11-22 深圳市优必选科技股份有限公司 A kind of information extracting method, information extracting device and intelligent terminal
CN110730115A (en) * 2019-09-11 2020-01-24 北京小米移动软件有限公司 Voice control method and device, terminal and storage medium
CN111400523A (en) * 2018-12-14 2020-07-10 北京三星通信技术研究有限公司 Image positioning method, device, equipment and storage medium based on interactive input
CN113031813A (en) * 2021-03-18 2021-06-25 Oppo广东移动通信有限公司 Instruction information acquisition method and device, readable storage medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109389005A (en) * 2017-08-05 2019-02-26 富泰华工业(深圳)有限公司 Intelligent robot and man-machine interaction method
KR20190056174A (en) * 2017-11-16 2019-05-24 서울시립대학교 산학협력단 Robot sytem and control method thereof
CN111400523A (en) * 2018-12-14 2020-07-10 北京三星通信技术研究有限公司 Image positioning method, device, equipment and storage medium based on interactive input
CN110489746A (en) * 2019-07-31 2019-11-22 深圳市优必选科技股份有限公司 A kind of information extracting method, information extracting device and intelligent terminal
CN110730115A (en) * 2019-09-11 2020-01-24 北京小米移动软件有限公司 Voice control method and device, terminal and storage medium
CN113031813A (en) * 2021-03-18 2021-06-25 Oppo广东移动通信有限公司 Instruction information acquisition method and device, readable storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113031813A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
WO2017185630A1 (en) Emotion recognition-based information recommendation method and apparatus, and electronic device
WO2022193911A1 (en) Instruction information acquisition method and apparatus, readable storage medium, and electronic device
CN109189879B (en) Electronic book display method and device
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
WO2020019591A1 (en) Method and device used for generating information
WO2016187888A1 (en) Keyword notification method and device based on character recognition, and computer program product
JP6986187B2 (en) Person identification methods, devices, electronic devices, storage media, and programs
CN110740389B (en) Video positioning method, video positioning device, computer readable medium and electronic equipment
WO2022041830A1 (en) Pedestrian re-identification method and device
US20170337222A1 (en) Image searching method and apparatus, an apparatus and non-volatile computer storage medium
CN111491187B (en) Video recommendation method, device, equipment and storage medium
CN111209970A (en) Video classification method and device, storage medium and server
CN113515942A (en) Text processing method and device, computer equipment and storage medium
KR20200109239A (en) Image processing method, device, server and storage medium
CN109582825B (en) Method and apparatus for generating information
US20230368461A1 (en) Method and apparatus for processing action of virtual object, and storage medium
CN110516083B (en) Album management method, storage medium and electronic device
WO2022001600A1 (en) Information analysis method, apparatus, and device, and storage medium
CN113392687A (en) Video title generation method and device, computer equipment and storage medium
CN111984803B (en) Multimedia resource processing method and device, computer equipment and storage medium
WO2023197648A1 (en) Screenshot processing method and apparatus, electronic device, and computer readable medium
CN110837557B (en) Abstract generation method, device, equipment and medium
US20140136196A1 (en) System and method for posting message by audio signal
US20180039626A1 (en) System and method for tagging multimedia content elements based on facial representations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22770266

Country of ref document: EP

Kind code of ref document: A1