CN116301337A

CN116301337A - Display device and visual question-answering method

Info

Publication number: CN116301337A
Application number: CN202310094906.3A
Authority: CN
Inventors: 柳杰
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-06-23

Abstract

The application provides a display device and a visual question-answering method, wherein the display device comprises: a display; a controller configured to: receiving a visual question-answering instruction, wherein the visual question-answering instruction comprises a screen capturing image and a user question; responding to the visual question-answering instruction, predicting a question category of the user question, and predicting an image scene label of the screen capturing image; determining a question-answer scene according to the matching result of the image scene tag and the question category; and generating a question-answering result corresponding to the visual question-answering instruction according to the answer strategy corresponding to the question-answering scene. The method and the device improve visual question-answering experience.

Description

Display device and visual question-answering method

Technical Field

The application relates to the technical field of visual question and answer, in particular to display equipment and a visual question and answer method.

Background

Visual questions and answers (Visual Question Answering, VQA) are learning tasks involving computer vision and natural language processing, the inputs of the visual questions and answers may include a picture and a natural language question, and the outputs may include a natural language answer. In the related technology, the visual question-answering method is to analyze the input picture and natural language questions to obtain image features and text features, then to fuse the image features and the text features to obtain fusion features, and to generate natural language answers based on the fusion features. However, the visual question-answering method does not consider the relevance between the image features and the text features, and the accuracy of the obtained question-answering result is poor.

Disclosure of Invention

In order to solve the technical problems, the application provides display equipment and a visual question-answering method.

In a first aspect, the present application provides a display device comprising:

a display;

a controller coupled to the display, the controller configured to:

receiving a visual question-answering instruction, wherein the visual question-answering instruction comprises a screen capturing image and a user question;

responding to the visual question-answering instruction, predicting a question category of the user question, and predicting an image scene label of the screen capturing image;

determining a question-answer scene according to the matching result of the image scene tag and the question category;

and generating a question-answering result corresponding to the visual question-answering instruction according to the answer strategy corresponding to the question-answering scene.

In some embodiments, the determining the question-answer scene according to the matching result of the image scene tag and the question category includes:

responding to the matching of the question category and the image scene label, and determining a question-answer scene according to the question category;

and responding to the mismatching of the question category and the image scene label, and determining the question-answering scene as a user-defined scene.

In some embodiments, the determining a question-answer scenario according to the question category includes:

Determining the question-answer scene as a film and television recommendation scene according to the question category as an entity category;

and determining the question-answering scene as a preset algorithm scene corresponding to the question category according to the fact that the question category is not the entity category.

In some embodiments, the generating the question-answering result corresponding to the visual question-answering instruction according to the answer policy corresponding to the question-answering scene includes:

acquiring introduction information of a target entity and film and television recommendation information associated with the target entity according to the question and answer scene as a film and television recommendation type scene, and generating a question and answer result containing the introduction information and the film and television recommendation information, wherein the target entity is determined according to the entity type and the entity identified in the screen capturing image, and the image scene label comprises a label corresponding to the target entity;

and calling an algorithm interface corresponding to the question category according to the question-answering scene as a preset algorithm category scene to acquire the prediction information of the entity corresponding to the keyword in the user question, and generating a question-answering result containing the prediction information.

In some embodiments, the predicting the image scene tag of the screenshot image comprises:

Detecting an entity in the screen capturing image, and generating a first label according to the detected entity;

generating description information of the screen capturing image, and generating a second label according to a word segmentation result of the description information;

respectively fusing the labels belonging to the same scene in the first label and the second label to obtain a fused label;

and fusing the labels belonging to the same scene in the fused labels to obtain the image scene label.

In some embodiments, fusing the labels belonging to the same scene in the fused labels to obtain an image scene label includes:

and carrying out weighted superposition on the prediction probabilities of the labels belonging to the same scene in the fusion labels, and screening out image scene labels from the fusion labels according to the weighted superposed prediction probabilities.

In some embodiments, the predicting the question category of the user question includes:

acquiring a prediction category with the highest probability of the user question through a naive Bayes classifier;

responding to the predicted category as an entity category, and determining the question category of the user question as the entity category;

and determining that the question category of the user question comprises the prediction category and a target category, wherein the target category is the entity category with the highest probability of the user question.

matching the entity category in the question category with the image scene tag;

determining a question-answer scene according to the prediction category in response to the entity category being matched with the image scene tag;

and determining the question-answer scene as a user-defined scene in response to the fact that the entity category is not matched with the image scene label.

In a second aspect, the present application provides a visual question-answering method, the method comprising:

The display device and the visual question-answering method have the beneficial effects that:

according to the method and the device for determining the answer strategy, after the question category of the user question and the image scene label of the screen capturing image are predicted, the answer scene is determined according to the matching result by matching the image scene label with the question category, the answer strategy is determined based on the answer scene, the answer result is further generated, the association degree of the user question and the screen capturing image is represented by the matching result of the image scene label and the question category, the answer strategy is determined according to the association degree of the user question and the screen capturing image, the accuracy of the answer result is improved, and the visual answer experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.

A schematic diagram of an operational scenario between a display device and a control apparatus according to some embodiments is schematically shown in fig. 1;

a hardware configuration block diagram of the control apparatus 100 according to some embodiments is exemplarily shown in fig. 2;

a hardware configuration block diagram of a display device 200 according to some embodiments is exemplarily shown in fig. 3;

a software configuration schematic of a display device 200 according to some embodiments is schematically shown in fig. 4;

a flow diagram of a visual question-answering method according to some embodiments is schematically shown in fig. 5;

a software architecture diagram of a visual question-answering method according to some embodiments is schematically shown in fig. 6;

a data processing flow diagram of a question classification module according to some embodiments is illustrated in fig. 7;

a data processing flow diagram of an image scene tag generation module according to some embodiments is illustrated in fig. 8;

a data processing flow diagram of the YOLOv7 network model according to some embodiments is schematically shown in fig. 9;

a flow diagram of an image description generation method according to some embodiments is schematically shown in fig. 10;

a data processing flow diagram of a question and answer result generation module according to some embodiments is schematically shown in fig. 11;

A flow diagram of a visual question-answering method based on the VQA model according to some embodiments is illustrated in fig. 12.

Detailed Description

In order to facilitate the technical solution of the application, some concepts related to the present application will be described below first.

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

Fig. 3 shows a hardware configuration block diagram of the display device 200 in accordance with an exemplary embodiment.

In some embodiments, display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, memory, a power supply, a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, for receiving image signals from the controller output, for displaying video content, image content, and a menu manipulation interface, and for manipulating a UI interface by a user.

In some embodiments, the display 260 may be a liquid crystal display, an OLED display, a projection device, and a projection screen.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

In some embodiments, the user interface may be configured to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

In some embodiments, the modem 210 receives broadcast television signals via wired or wireless reception and demodulates audio-video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other operable control. The operations related to the selected object are: displaying an operation of connecting to a hyperlink page, a document, an image, or the like, or executing an operation of a program corresponding to the icon.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions received from the outside, so as to finally display and play various audio and video contents. The CPU processor may include a plurality of processors. Such as one main processor and one or more sub-processors.

In some embodiments, a graphics processor is used to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor comprises an arithmetic unit, which is used for receiving various interactive instructions input by a user to operate and displaying various objects according to display attributes; the device also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, image composition, etc., according to a standard codec protocol of an input signal, and may obtain a signal that is displayed or played on the directly displayable device 200.

In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the demultiplexed video signal, including decoding, scaling and the like. And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received frame rate into a video output signal and changing the video output signal to be in accordance with a display format, such as outputting RGB data signals.

In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of an input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing to obtain a sound signal that can be played in a speaker.

In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

In some embodiments, a "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a form acceptable to the user. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), abbreviated as UI, which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

Referring to FIG. 4, in some embodiments, the system is divided into four layers, from top to bottom, an application layer (referred to as an "application layer"), an application framework layer (Application Framework layer) (referred to as a "framework layer"), a An Zhuoyun row (Android run) and a system library layer (referred to as a "system runtime layer"), and a kernel layer, respectively.

In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, or the like; or may be an application developed by a third party developer. In particular implementations, the application packages in the application layer are not limited to the above examples.

The framework layer provides an application programming interface (application programming interface, API) and programming framework for the application. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act. Through the API interface, the application program can access the resources in the system and acquire the services of the system in the execution.

As shown in fig. 4, the application framework layer in the embodiment of the present application includes a manager (manager), a Content Provider (Content Provider), and the like, where the manager includes at least one of the following modules: an Activity Manager (Activity Manager) is used to interact with all activities that are running in the system; a Location Manager (Location Manager) is used to provide system services or applications with access to system Location services; a Package Manager (Package Manager) for retrieving various information about an application Package currently installed on the device; a notification manager (Notification Manager) for controlling the display and clearing of notification messages; a Window Manager (Window Manager) is used to manage bracketing icons, windows, toolbars, wallpaper, and desktop components on the user interface.

In some embodiments, the activity manager is used to manage the lifecycle of the individual applications as well as the usual navigation rollback functions, such as controlling the exit, opening, fallback, etc. of the applications. The window manager is used for managing all window programs, such as obtaining the size of the display screen, judging whether a status bar exists or not, locking the screen, intercepting the screen, controlling the change of the display window (for example, reducing the display window to display, dithering display, distorting display, etc.), etc.

In some embodiments, the system runtime layer provides support for the upper layer, the framework layer, and when the framework layer is in use, the android operating system runs the C/C++ libraries contained in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

The hardware or software architecture in some embodiments may be based on the description in the foregoing embodiments, and in some embodiments may be based on other similar hardware or software architectures, which may implement the technical solutions of the present application.

In some embodiments, the user may make a visual question and answer on the display device. The user can input a visual question-answering instruction on the display device, and the display device processes the screen capturing image corresponding to the visual question-answering instruction and the user question sentence to obtain and display a question-answering result.

In order to improve the question and answer accuracy of visual question and answer, the embodiment of the application provides a visual question and answer method, wherein a question and answer result is generated on the basis of a answer strategy determined by a question and answer scene when the correlation degree of a screen capturing image and a user question is high, and a question and answer result is generated on the basis of a preset answer strategy when the correlation degree of the screen capturing image and the user question is low, so that diversified answer strategies are provided for different correlation degrees of the screen capturing image and the user question, and the answer accuracy is improved.

Referring to fig. 5, a visual question-answering method according to an embodiment of the present application, as shown in fig. 5, may include the following steps:

step S501: and receiving a visual question-answering instruction, wherein the visual question-answering instruction comprises a screen capturing image and a user question.

In some embodiments, the user may enter a user question after the display device is screen-captured, and the display device generates visual question-answering instructions from the screen-captured image and the user question. For example, on a media asset playing page of a display device, a user may enter a user question after the display device is screen-captured: "who this actor is? ".

In some embodiments, after receiving the visual question-answering instruction, the display device may acquire a screen capturing image of the display device and a user question, and start a visual question-answering processing procedure.

Step S502: and responding to the visual question-answering instruction, predicting a question category of the user question, and predicting an image scene label of the screen capturing image.

In some embodiments, the display device, upon receiving the visual question and answer instruction, may analyze the user question and the screen capture image, respectively.

The analysis of the user question may include predicting a question category of the user question, where the question category may include a plurality of preset categories such as an animal category, a plant category, a face recognition category, a movie question-answer category, and a caloric category. The question category may be obtained based on a keyword in the user question, and the question category of the user question may be determined according to a preset correspondence between the keyword and the question category.

For example, for a user question, "who is this actor? ", question categories can be derived as face recognition categories based on the keyword" actor "for user questions" what is this plant? The method comprises the steps of obtaining a question category as a plant category based on a keyword 'plant', wherein the corresponding relation between an actor 'and a face recognition category and the corresponding relation between the plant' and the plant category can be stored in a question type table in a MySQL database in advance, the question type table comprises a plurality of preset corresponding relations between keywords and the question category, and the question type of a user question can be obtained by inquiring the question type table.

Analysis of the screenshot image may include predicting an image scene tag of the screenshot image, which may include a plurality of preset image scene tags, such as an animal scene tag, a plant scene tag, a person scene tag, and the like. The number of preset image scene tags may be the same as and in one-to-one correspondence with the number of question categories, or the number of preset image scene tags may be greater than the number of question categories, one question category may correspond to a plurality of preset image scene tags, or the number of preset image scene tags may be less than the number of question categories, and a plurality of question categories may correspond to one preset image scene tag.

For example, the preset image scene tag may be determined based on an entity identified in the screenshot image, and the image scene tag of the screenshot image may be determined according to a preset correspondence of the entity and the preset image scene tag.

For example, if the entity detected by the target recognition algorithm of the screen capturing image includes a "person", the image scene tag may be obtained as a person scene tag based on the entity "person", and if the entity detected by the target recognition algorithm of the screen capturing image includes a "flower", the image scene tag may be obtained as a plant scene tag based on the entity "flower", where the correspondence between the "person" and the image scene tag and the correspondence between the "flower" and the plant scene category may be stored in the image scene tag table in the MySQL database in advance, and the image scene tag of the screen capturing image may be obtained by querying the image scene tag table.

Step S503: and determining a question and answer scene according to the matching result of the image scene tag and the question category.

In some embodiments, the number of question categories of the user question is one, the number of image scene tags of the screen capturing image is one, if the question category of the user question corresponds to the image scene tag of the screen capturing image, the image scene tag is determined to be matched with the question category, and if the question category does not correspond to the image scene tag, the image scene tag is determined to be not matched with the question category.

In some embodiments, the number of question categories of the user question is one, the number of image scene tags of the screen capturing image is a plurality, if the question category of the user question corresponds to at least one image scene tag of the screen capturing image, the image scene tag is determined to be matched with the question category, and if the question category of the user question does not correspond to all the image scene tags of the screen capturing image, the image scene tag is determined to be not matched with the question category.

In some embodiments, a corresponding relation table of each image scene tag and question category may be pre-stored in the MySQL database, and by querying the corresponding relation table, it may be determined whether the image scene tag of the screenshot image matches the question category of the user question.

In some embodiments, based on the entity corresponding to the image scene tag and the entity corresponding to the question category, it may also be determined whether the image scene tag of the screenshot image matches the question category of the user question. If the entity corresponding to the at least one image scene tag is the same as the entity corresponding to the question category, determining that the image scene tag of the screen capturing image is matched with the question category of the user question, otherwise, not matching, wherein the same entity is the target entity.

In some embodiments, if the image scene tag matches the question category, it may be determined that the question scene corresponding to the visual question instruction is the question scene corresponding to the question category.

In some embodiments, if the image scene tag does not match the question category, it may be determined that the question-answer scene corresponding to the visual question-answer instruction is not a question-answer scene corresponding to the question category, where the question category of the user question may be a preset question category, and the image scene tag of the screen capturing image is also a preset image scene tag, but the question category of the user question does not correspond to the image scene tag of the screen capturing image, or the question category of the user question does not belong to the preset question category, or the image scene tag of the screen capturing image is not a preset image scene tag. For example, for the above scenario, the reply may be performed by a preset reply policy. For example, the preset reply policy may be outputting a question and answer result through a BLIP-VQA (Bootstrapping Language-Image Pre-training) visual question and answer model, and if the visual question and answer model does not output a question and answer result, the Image description information of the screen capturing Image may be determined as a question and answer result.

Step S504: and generating a question-answering result corresponding to the visual question-answering instruction according to the answer strategy corresponding to the question-answering scene.

In some embodiments, after obtaining the question-answer scene, if the question-answer scene is a question-answer scene corresponding to a question category, a reply strategy corresponding to the question category may be obtained, and a question-answer result corresponding to the visual question-answer instruction is generated according to the reply strategy.

In some embodiments, after obtaining the question-answer scene, if the question-answer scene is a user-defined scene, the question-answer result corresponding to the visual question-answer instruction may be generated according to a preset answer policy.

For example, different question-answer scenes may correspond to different answer policies, and types of answer contents may be different, for example, for a face recognition type scene corresponding to a face recognition type, the answer policy may include a recognition result of a screenshot image and a movie recommendation, and for a heat calculation type scene corresponding to a heat type, the answer policy may include invoking an algorithm interface of heat calculation to obtain heat of an entity being queried in the screenshot image, where the algorithm interface of heat calculation may be an algorithm interface provided by a third party.

In some embodiments, after the display device receives the visual question-answering instruction in step S501, the visual question-answering instruction may also be sent to the server, and the server executes steps S502-S504 to obtain a question-answering result and feeds back the question-answering result to the display device.

As can be seen from fig. 5, in the embodiment of the present application, the question category is matched with the image scene tag, if the question category is matched with the image scene tag, it indicates that the relevance between the screen capturing image and the user question is relatively high, the question-answering scene can be determined according to the matching result, the reply strategy corresponding to the question-answering scene is selected for reply, if the question category is not matched with the image scene tag, it indicates that the relevance between the screen capturing image and the user question is relatively low, the question-answering result can be obtained by presetting the reply strategy, thereby realizing the subdivision of the question-answering scene of the visual question-answering, being beneficial to improving the accuracy of the visual question-answering and improving the visual question-answering experience.

To further describe the visual question-answering method provided in the embodiments of the present application, the following will be described with reference to a software architecture of the visual question-answering method. Referring to fig. 6, in some embodiments, a software architecture of a visual question-answering method includes a question classification module, an image scene tag generation module, and a question-answering result generation module, where the question classification module is used for predicting a question category of a user question, the image scene tag generation module is used for predicting an image scene tag of a screen shot image, and the question-answering result generation module is used for performing matching analysis on the image scene tag and the question category and generating a question-answering result according to the matching result.

In some embodiments, the data processing flow of the question classification module may refer to fig. 7, and as shown in fig. 7, the processing of the user question by the question classification module includes steps S601 to S607.

Step S601: a user question is entered.

Step S602: text segmentation and part-of-speech classification

In some embodiments, taking a user question as an example of what animal is in the graph, the user question can be subjected to text word segmentation and part-of-speech classification by using a LAC (Lexical Analysis of Chinese, chinese lexical analysis) text word segmentation device, so as to obtain the following question part-of-speech classification result: "graph (n), middle (f), there are (v), what (r), animal (n)".

Step S603: part-of-speech substitution.

In some embodiments, after the part-of-speech classification result is obtained, keyword matching may be performed on the part-of-speech word preset in the user question through a keyword list stored in the Mysql database in advance, and if the part-of-speech word in the user question is matched with the keyword in the keyword list, part-of-speech replacement is performed on the part-of-speech word in the part-of-speech classification result, i.e., the part-of-speech word is replaced with the keyword in the keyword list. And after the part of speech of the user question is replaced, obtaining the question in the first format. For example, the preset part of speech includes nouns, keywords corresponding to a plurality of nouns are set in a keyword list, for example, "animal" corresponds to "animal", and then a first format question obtained by replacing what animal "in the figure by part of speech is: "what animal is in the graph".

Step S604: question classification template matching

In some embodiments, after obtaining the first format question, performing keyword matching on the first format question and a question sample in the question classification template, determining the question sample containing the keyword as a question sample corresponding to the user question, where the question sample is a second format question, for example, a question sample containing the keyword "animal" is "what animal" and it may be determined, according to the keyword matching result, what animal "is" what animal "in the graph.

Referring to table 1, a matching list of question classification templates is shown:

TABLE 1

Sequence number	Question sample	MySQL replacement word	Question category
				1	What animal this is	Animal species: animal al	Animals
2	What plants are in the figure	Plants: plant	Plants and methods of making the same
				3	What brand of vehicle this is	Vehicle class: vehicle	Vehicle with a vehicle body having a vehicle body support
4	What food this is	Food: food (Food)	Food product
				5	What the heat of this food is	Heat type: calorie	Heat quantity
6	What works the actor has	Film and television questions and answers: KBQA	Film and television question and answer
				7	Who is the person	Face recognition: who (what is a kind of name)	Face recognition
8	Several dogs are shown	Counting: number of number	Counting
				…	…	…	…

In table 1, mySQL substitute words are keywords matched by question classification templates, and each question sample corresponds to a question class. By way of example, question categories may include: animal category, plant category, vehicle category, food category, caloric category, movie question and answer category, face recognition category, count category, etc.

In some embodiments, as shown in table 1, the question categories are all single categories.

In some embodiments, the question category may be a single category or a compound category, e.g., the question categories of question samples 1-4, 6, 7 are single categories and the question category of question sample 5 is a compound category: food category and caloric category the question category of question sample 8 is a compound category: animal category and count category.

Step S605: word segmentation vectorization.

In some embodiments, after obtaining the question sample corresponding to the user question, the question sample may be word-segmented and vectorized.

Step S606: and (5) classifying by a classifier.

In some embodiments, after the word segmentation vector is obtained, the word segmentation vector is input into a naive bayes classifier, and classification is performed by the naive bayes classifier to obtain a question classification result of the question sample.

The NBC (Naive Bayes Classifier ) is a classification method based on bayes theorem, and the feature conditions are assumed to be independent from each other. For a given training data set, by assuming that feature conditions are mutually independent as preconditions, joint probability distribution from input to output is learned, and the output with the maximum posterior probability can be obtained according to the learned algorithm model. In some embodiments, the feature condition is a word segmentation vector, a sample data class y _i Is:

in some embodiments, the scene category set is M, so y _i For the ith of M question types in Table I, x represents the question sample after question classification template matching(what is the input, x) _j Then any word in the input sample is represented (e.g., this, yes, what, animal) and d represents the total amount of words in the question sample (this is d=4 in what animal). In a training data set for training the naive Bayesian classifier, the training data set mainly comprises question sample cases of M classes, so that the output classes and probabilities only comprise M classes, when keywords animal, plan, food, calorie and the like appear in question, the classifier can judge that the question sample cases belong to corresponding question classes, and output question classification results comprising the question classes and probabilities, so that classification of user question is realized.

Step S607: and outputting a question classification result.

In some embodiments, the question categories are all single categories, and after word segmentation vectors corresponding to the question examples are input into the naive Bayesian classifier, the question category with the highest probability output by the naive Bayesian classifier is determined as the question category of the question examples.

In some embodiments, the question category includes a single category and a compound category, and after the word segmentation vector corresponding to the question sample is input into the naive bayes classifier, if the predicted category with the highest probability output by the naive bayes classifier is an entity category, such as an animal category, the predicted category is determined to be the question category of the question sample; if the predicted category with the highest probability output by the naive Bayesian classifier is not an entity category, such as a caloric category, the predicted category with the highest probability and the entity category (such as a food category) with the highest probability are determined to be question categories of the question sample.

In some embodiments, the data processing flow of the image scene tag generation module may refer to fig. 8, and as shown in fig. 8, the processing of the screen capturing image by the image scene tag generation module includes steps S701-S707.

Step S701: a screen capture image is input.

Step S702: and (5) detecting a target.

In some embodiments, target detection may be implemented using the yolv 7 network model. The YOLOv7 network model is mainly composed of three parts, and as shown in fig. 9, the YOLOv7 network model includes an INPUT (image INPUT) modelBlock, back bone network module and HEAD network. Inputting RGB images with 640 x 3 size into a back plane network, continuously outputting feature maps (feature maps) with three layers of different size in a head layer through the back plane network according to three-layer output in the back plane network, predicting three types of tasks (classification, front and back background classification and frame) of image detection through RepVGG block and Conv, and finally outputting a frame length w containing the highest reliability class in the N types and the central point coordinates (x, y) of the frame of the target in the image, wherein the three types of tasks comprise the probability of the target, the highest reliability class in the N types, and the frame length w _x And width w _y 。

In some embodiments, after obtaining the category of the target output by the YoloV7 network, a first tag corresponding to the category may be obtained.

In some embodiments, the target detection may be implemented using other network models, such as by the yolv 3 network model.

Step S703: and (5) label fusion.

In some embodiments, the YoloV7 network detects that the top5 category of the screenshot image output is x ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ (if the number of the detected categories is less than 5, for example, if the number of the detected categories is only two, then x ₃ ,x ₄ ,x ₅ 0) for five classes of probabilities P (x) ₁ ),P(x ₂ ),...,P(x ₅ ) (highest probability of category is taken by each category), and x is queried according to MySQL database ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ Belonging scene category (scenes of vehicle, animal, plant, food, person, film and television question and answer, etc.), tags belonging to the same scene are superimposed (assuming x ₁ ,x ₂ Belonging to animal scene class, x ₃ ,x ₄ Belonging to the plant scene class, x ₅ Belonging to the food scene class), then:

similarly, labels of other scene categories are fused.

Step S704: and (5) image description.

In some embodiments, the Image description may be obtained through a BLIP network model, which may be referred to as FIG. 10, which uses Visual Transformer as an Image Encoder (Image Encoder), divides the input screenshot Image into patches, then encodes the patches into an email sequence, and uses an additional [ CLS ] tag to represent the global Image feature. The text decoder (Image-grounded text decoder) decodes the sequence of ebedding and the Encode output ebedding is used as a multimodal representation of the Image-text pairs, i.e. to enable Image description of the screenshot images.

After the image description of the screen capturing image is obtained, the image description can be subjected to LAC word segmentation and part-of-speech classification, and the noun word x is obtained _B1 ,...,x _Bn And (n is the number of nouns in the sentence), extracting, and querying the category of the query segmentation of the MySQL database to obtain a second label corresponding to the segmentation.

Step S705: and (5) label fusion.

In some embodiments, the method of fusing the second tag is the same as the method of fusing the first tag.

Exemplary, x _B1 ,x _B2 Belonging to the animal class), then:

similarly, labels of other scene categories are fused.

Step S706: and (5) label fusion.

In some embodiments, the prediction probabilities of the tags belonging to the same scene in the fused first tag and the fused second tag may be weighted and overlapped, and the image scene tag may be screened from the fused tags according to the weighted and overlapped prediction probabilities.

Illustratively, the probabilities for scene tag animals are:

P _animal ＝0.7×P _Y-animal +0.3×P _B-animal

similarly, other scene categories are calculated in the same way, the probability threshold is set to be 0.2, and when P is _{Scene category} And (3) when the content is more than or equal to 0.2, retaining the label of the scene category, otherwise, deleting the label of the scene category. And obtaining image scene labels according to the finally reserved scene type labels (for example, the reserved labels are animal scene labels and plant scene labels, and then the animal scene labels and the plant scene labels are output).

In some embodiments, the question and answer result generation module may determine whether the question category of the user question matches the image scene tag of the screenshot image according to the correspondence between the question category and the image scene tag stored in the MySQL database. The data processing flow of the question-answer result generation module may refer to fig. 11, including step S801 to step S810.

Step S801: and judging whether the question category is matched with the image scene label.

In some embodiments, if the entity corresponding to the image scene tag includes an entity corresponding to a question category, determining that the question category is matched with the image scene tag, e.g., the entity corresponding to the image scene tag is an animal, and the entity corresponding to the question category is an animal, then the question category is matched with the image scene tag; if the entity corresponding to the image scene tag does not contain the entity corresponding to the question category, determining that the question category and the image scene tag are not matched, for example, the entity corresponding to the image scene tag is a plant, the entity corresponding to the question category is a person, and the question category and the image scene tag are not matched.

Step S802: if the question category is matched with the image scene label, determining a question and answer scene according to the question category.

In some embodiments, if the question category of the user question matches the image scene tag of the screen capture image, a question-answer scene may be determined based on the matching result of the question category and the image scene tag. Illustratively, the animal category in the question category matches the animal scene tag in the image scene tag, and the plant category in the question category matches the plant scene tag in the image scene tag.

In some embodiments, the question-answer scenes may be classified into a person movie recommendation class scene, other movie recommendation class scenes, a preset algorithm class scene, and a user-defined scene.

The character movie recommendation type scene represents a question category of a user question and image scene labels in a screen capturing image correspond to characters, such as a face recognition type scene and a movie question and answer type scene.

Other film and television recommendation type scenes represent the entities except characters in the question category of the user question and the image scene label in the screen capturing image, and the question category only comprises the entity category and does not comprise non-entity categories such as animal type scenes, plant type scenes, vehicle type scenes and food type scenes.

The preset algorithm class scene characterizes the question category of the user question and the image scene label in the screen capturing image to correspond to the entity except the person, and the question category comprises non-entity categories such as a heat calculation class scene and a counting class scene. If all the question categories are single categories, non-entity categories in the question categories also correspond to corresponding entities, for example, heat categories correspond to the entity of food, and when the entity corresponding to the image scene tag comprises food, the question-answering scene is determined to be a heat calculation type scene; if all the question categories include a single category and a composite category, the entity categories in the composite category are used for matching with the entities corresponding to the image scene tags, for example, if the question categories are heat categories and food categories and the entities corresponding to the image scene tags include food, the question-answer scene is determined to be a heat calculation type scene.

The user-defined scene characterizes misstatement category of the user misstatement and the image scene label of the screen capturing image are not matched.

In some embodiments, the person video recommendation class scene and the other video recommendation class scene both belong to a video recommendation class scene in which a video recommendation may be made for a user.

In some embodiments, different reply policies may be employed to output the question and answer results of the visual questions and answers for different question and answer scenarios.

Step S803: and judging whether the question-answer scene is a character movie recommendation scene.

In some embodiments, when the question category of the user question matches with the image scene tag of the screen capturing image, determining that the question and answer scene is a movie recommendation scene according to the question category as an entity category, such as a face recognition category, an animal category and the like. According to the question category being a face recognition category and a movie question-answer category, the question-answer scene can be determined to be a person movie recommendation scene.

Step S804: and if the question-answering scene is a character movie recommendation scene, determining characters to be subjected to movie recommendation.

In some embodiments, when the question-answer scene is a movie recommendation scene, the answer policy may be to output entity introduction information and movie recommendation information, where if the entity is a person, the entity introduction information may be person introduction information, and if the entity is a non-person entity such as an animal, a plant, etc., the entity introduction information may be introduction information of a corresponding category of entity.

In some embodiments, the person to be film-recommended may be determined from the person tags in the image scene tags. If there are multiple character labels, the character to be subjected to movie recommendation can be further determined according to the user question, for example, the position of the character to be subjected to movie recommendation in the screen capturing image is determined according to character feature keywords in the user question, the character label corresponding to the position is obtained, and the character feature keywords can include keywords representing the position of the character or keywords representing feature information of the character such as clothes color, gender and the like.

Step S805: and generating film recommendation information through a film recommendation module.

In some embodiments, the server may be provided with a movie recommendation module, and the movie recommendation module may acquire character introduction information and movie recommendation information associated with a character to be subjected to movie recommendation based on a movie knowledge graph, where the movie recommendation information may include movie work information of a character director or a participating person corresponding to the character tag.

Step S806: and generating a question and answer result containing the film and television recommendation information.

In some embodiments, if the question-answer scene is a movie recommendation scene, the question-answer result is the introduction information and movie recommendation information of the output target entity.

In some embodiments, if the character introduction information and the movie recommendation information are obtained according to the movie recommendation module, a question-answer result containing the character introduction information and the movie recommendation information can be generated, and the question-answer result is displayed on the display device.

Step S807: if the question-answer scene is not the character movie recommendation scene, judging whether the question-answer scene is other movie recommendation scenes.

In some embodiments, when the question category of the user question matches with the image scene tag of the screen capturing image, determining that the question and answer scene is a movie recommendation scene according to the question category as an entity category, such as a face recognition category, an animal category and the like. The question and answer scene can be determined to be other film and television recommendation scenes according to the question category being an animal category, a plant category, a vehicle category and a food category.

Step S808: and if the question-answer scene is other film and television recommendation scenes, determining an entity to be subjected to film and television recommendation.

In some embodiments, the entity to be film and television recommended may be determined according to an entity tag in the image scene tag. If there are multiple entity tags, the entity to be recommended for the movie may be further determined based on the entity feature keywords in the user question, where the entity feature keywords may include keywords indicating the location of the entity or keywords indicating feature information of the entity, such as color, size, etc.

In some embodiments, for an entity to be film-recommended, corresponding entity introduction information may be obtained through a third party interface, and film-recommended information associated with the entity to be film-recommended may be obtained through a film-recommended module, so as to generate a question-answer result including the entity introduction information and the film-recommended information.

Step S809: and if the question-answering scene is not the other film and television recommendation scene, calling an algorithm interface corresponding to the question category to generate a question-answering result.

In some embodiments, when the question category of the user question matches with the image scene tag of the screen capturing image, determining that the question-answering scene is a preset algorithm class scene corresponding to the question category according to the fact that the question category is not an entity category, for example, determining that the question-answering scene is a heat calculation class scene according to the question category as a heat category, and determining that the question-answering scene is a counting class scene according to the question category as a counting category.

When the question-answering scene is a preset algorithm type scene, the reply strategy can be used for outputting entity introduction information and entity prediction information corresponding to a user question, wherein the entity introduction information can be obtained based on an entity tag in a screen capturing image, and the entity prediction information can be obtained based on a preset algorithm associated with a question sample corresponding to the user question.

Step S810: if the question category is not matched with the image scene label, outputting a question and answer result or outputting a general question and answer result containing image description information through a VQA model.

In some embodiments, when the question-answer scene is a user-defined scene, the reply policy may be to output a question-answer result through the VQA model and to output description information of the screen capture image. Referring to fig. 12, a flowchart of a visual question-answering method of VQA model is shown in fig. 12, in which the BLIP Image uses Visual Transformer as an Image Encoder (Image Encoder), the input screen capture Image is divided into patches, then the patches are encoded into an email sequence, and an additional CLS mark is used to represent the global Image feature. The Image question decoder (Image-grounded text decoder) decodes the emmbedding sequence, and then the answer decoder decodes the emmbedding sequence output by the question decoder, thereby realizing the visual question and answer.

In some embodiments, when the question-answering scenario is a user-defined scenario, the reply strategy may be to output a question-answering result through the VQA model first, if the VQA model does not output a question-answering result meeting the expectations, the description information of the screen capturing image may be output as the question-answering result, for example, if the question-answering result output by the VQA model is a default result, for example, "i don't understand your question, please describe another specific point", or if the VQA model does not output a question-answering result, the description information of the screen capturing image is output as the question-answering result of the user question.

According to the embodiment, after the question category of the user question and the image scene label of the screen capturing image are predicted, the image scene label is matched with the question category, the question and answer scene is determined according to the matching result, the answer strategy is determined based on the question and answer scene, the question and answer result is further generated, the association degree of the user question and the screen capturing image is represented through the matching result of the image scene label and the question category, the answer strategy is determined according to the association degree of the user question and the screen capturing image, accuracy of the question and answer result is improved, and visual question and answer experience is improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display;

a controller coupled to the display, the controller configured to:

2. The display device according to claim 1, wherein the determining a question-answer scene according to a matching result of the image scene tag and the question category includes:

3. The display device of claim 2, wherein the determining a question-answer scenario from the question category comprises:

4. A display device according to claim 3, wherein the generating the question-answer result corresponding to the visual question-answer instruction according to the answer policy corresponding to the question-answer scene comprises:

5. The display device of claim 1, wherein the predicting the image scene tag of the screenshot image comprises:

6. The display device according to claim 5, wherein the fusing the labels belonging to the same scene in the fused labels to obtain the image scene label includes:

7. The display device of claim 1, wherein the predicting question categories of the user questions comprises:

8. The display device according to claim 7, wherein the determining a question-answer scene according to a matching result of the image scene tag and the question category includes:

matching the entity category in the question category with the image scene tag;

9. A method of visual question answering, comprising:

10. The visual question-answering method according to claim 9, wherein the determining a question-answering scene according to the matching result of the image scene tag and the question category comprises: