CN116204726B

CN116204726B - Data processing method, device and equipment based on multi-mode model

Info

Publication number: CN116204726B
Application number: CN202310493430.0A
Authority: CN
Inventors: 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-07-25
Anticipated expiration: 2043-04-28
Also published as: CN116204726A

Abstract

The application provides a data processing method, device and equipment based on a multi-mode model, wherein the method comprises the following steps: determining initial question features that match the configured question template based on the original input questions; acquiring a plurality of additional question features associated with the initial question feature based on the initial question feature; generating a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question comprising the initial question feature and at least one additional question feature; aiming at each target problem, inputting the target problem into a trained multi-mode target model, and outputting a candidate answer corresponding to the target problem by the multi-mode target model; and determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question. By the technical scheme, the computing resource of the server can be saved, the time cost is saved, and the target answer corresponding to the original input problem can be rapidly positioned.

Description

Data processing method, device and equipment based on multi-mode model

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, and device based on a multi-modal model.

Background

With the rapid development of network technology, network data is increasing, and in order to retrieve data required by a user from a large amount of network data, keywords need to be provided by the user, and a server retrieves the data required by the user from the large amount of network data based on the keywords. However, since the data amount of the network data is large, the server needs to consume a large amount of computing resources to be able to retrieve the data required by the user, and the server needs to take a large amount of time, resulting in a large amount of resource consumption of the server, and the user can learn the required data for a long time. For example, an actual scene may deploy a large number of cameras that may capture a large number of images, and the server may consume a large amount of computing resources and take a large amount of time to analyze the user-desired data from the images.

Disclosure of Invention

The application provides a data processing method based on a multi-mode model, which comprises the following steps:

determining initial question features that match the configured question template based on the original input questions;

Acquiring a plurality of additional question features associated with the initial question feature based on the initial question feature;

generating a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question comprising the initial question feature and at least one additional question feature;

aiming at each target problem, inputting the target problem into a trained multi-mode target model, and outputting a candidate answer corresponding to the target problem by the multi-mode target model;

and determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question.

The application provides a data processing device based on a multi-modal model, the device comprising:

the determining module is used for determining initial problem characteristics matched with the problem template based on the original input problem;

an acquisition module for acquiring a plurality of additional problem features associated with the initial problem feature;

a generation module for generating a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question comprising the initial question feature and at least one additional question feature;

the acquisition module is further used for inputting the target questions to a trained multi-modal target model aiming at each target question, and outputting candidate answers corresponding to the target questions by the multi-modal target model;

The determining module is further configured to determine a target answer corresponding to the original input question based on the candidate answer corresponding to each target question, where the target answer is an output result of the original input question.

The application provides an electronic device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute the machine executable instructions to implement the multi-modal model-based data processing method of the above example.

According to the technical scheme, in the embodiment of the application, the plurality of target questions can be generated based on the initial question features and the plurality of additional question features, the candidate answer corresponding to each target question is obtained through the multi-mode target model, the target answer corresponding to the original input question is determined based on the candidate answer corresponding to each target question, so that the target answer is obtained through the multi-mode target model, the accurate and reliable target answer corresponding to the original input question can be obtained through the plurality of target questions, the target answer corresponding to the original input question can be queried from a large amount of data only by consuming a small amount of computing resources, the time for obtaining the target answer is shorter, the computing resources of a server are saved, the time cost is saved, the target answer corresponding to the original input question can be rapidly located, abnormal logic analysis capability is provided, automatic abnormal diagnosis is achieved, accurate and reliable target answers are given out by utilizing the analysis capability of the multi-mode target model, and the possibility of various dimensions such as images, videos and voices are comprehensively considered.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly describe the drawings that are required to be used in the embodiments of the present application or the description in the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may also be obtained according to these drawings of the embodiments of the present application for a person having ordinary skill in the art.

FIG. 1 is a flow diagram of a multi-modal model-based data processing method in one embodiment;

FIG. 2 is a flow diagram of a multi-modal model-based data processing method in one embodiment;

FIG. 3 is a schematic diagram of a multi-modal model-based self-reasoning method in one embodiment;

FIG. 4 is a schematic diagram of a data processing apparatus based on a multimodal model in one embodiment;

fig. 5 is a hardware configuration diagram of an electronic device in one embodiment.

Detailed Description

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to any or all possible combinations including one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Depending on the context, furthermore, the word "if" used may be interpreted as "at … …" or "at … …" or "in response to a determination".

The embodiment of the application provides a data processing method based on a multi-mode model, which can be applied to any device, such as a server, and the like, and referring to fig. 1, the method can include:

step 101, determining initial question features matched with the configured question templates based on the original input questions.

Step 102, acquiring a plurality of additional question features associated with the initial question feature based on the initial question feature.

Step 103, generating a plurality of target questions based on the initial question feature and the plurality of additional question features; wherein each target question includes the initial question feature and at least one additional question feature.

Step 104, inputting the target questions into the trained multi-modal target model aiming at each target question, and outputting candidate answers corresponding to the target questions by the multi-modal target model.

Step 105, determining a target answer corresponding to the original input question based on the candidate answer corresponding to each target question, where the target answer may be a final output result for the original input question.

Illustratively, determining initial question features that match the configured question template based on the original input questions may include, but is not limited to: if the question template comprises a feature type, determining whether a key value matched with the feature type exists in the original input question; if so, an initial problem feature may be determined based on the key value. If not, the original input problem can be analyzed to obtain an initial problem feature matched with the feature type; alternatively, a prompt message for the feature type may be sent to the user, the user may provide a new original input question based on the prompt message, and if a key value matching the feature type exists in the new original input question, the initial question feature may be determined based on the key value.

Illustratively, the obtaining of the plurality of additional issue features associated with the initial issue feature based on the initial issue feature may include, but is not limited to: the initial problem feature is input to a trained feature analysis target model, and a plurality of additional problem features associated with the initial problem feature are output by the feature analysis target model. Alternatively, the initial question feature is input to a multi-modal object model, and a plurality of additional question features associated with the initial question feature are output by the multi-modal object model. Or, a plurality of additional issue features associated with the initial issue feature based on the initial issue feature analysis (i.e., analyzing the initial issue feature).

Illustratively, the feature analysis target model may be a feature analysis target model that matches the target scene type; the problem template may be a problem template that matches the target scene type; the target scene type may be a behavioral anomaly type, a vehicular anomaly type, or an industrial anomaly type.

Illustratively, generating the plurality of target questions based on the initial question feature and the plurality of additional question features may include, but is not limited to: selecting candidate additional problem features from the plurality of additional problem features; after the initial question feature and the candidate additional question feature are combined, context information is added to the initial question feature and/or the candidate additional question feature, and a target question corresponding to the candidate additional question feature is obtained.

Illustratively, determining the target answer for the original input question based on the candidate answer for each target question may include, but is not limited to: filtering repeated candidate answers from all candidate answers based on the candidate answers corresponding to each target question to obtain at least one residual candidate answer, and determining the occurrence times corresponding to each residual candidate answer; determining the residual candidate answers with the largest occurrence number as target answers corresponding to the original input questions; or if the appearance proportion value corresponding to the maximum appearance times is larger than the preset threshold value, determining the remaining candidate answers with the maximum appearance times as target answers corresponding to the original input questions.

For example, after determining a target answer corresponding to the original input question based on the candidate answer corresponding to each target question, target data corresponding to the target answer may be output, where the target data may include the target answer, or the target data may include the target answer and additional question features corresponding to the target answer; wherein, the target answers may include, but are not limited to, answers in at least one dimension of images, text, sound, and video; the target answer may correspond to a plurality of candidate answers, and the additional question feature corresponding to the target answer may include an additional question feature corresponding to each candidate answer corresponding to the target answer.

For example, the above execution sequence is only an example given for convenience of description, and in practical application, the execution sequence between steps may be changed, which is not limited. Moreover, in other embodiments, the steps of the corresponding methods need not be performed in the order shown and described herein, and the methods may include more or less steps than described herein. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; various steps described in this specification, in other embodiments, may be combined into a single step.

The above technical solutions of the embodiments of the present application are described below with reference to specific application scenarios.

Since the data amount of network data (such as text data, video data, audio data, image data, etc.) is large, the server needs to consume a large amount of computation resources to be able to retrieve data required by the user, and a large amount of time is required, resulting in a large amount of resource consumption of the server, and the user can learn the required data for a long time. For example, an actual scene may deploy a large number of cameras that capture a large number of images, and in order to analyze the data required by the user from these images, a large amount of computing resources are consumed, and a large amount of time is spent.

Aiming at the finding, the embodiment of the application provides a data processing method based on a multi-mode model, a target answer is obtained by means of the multi-mode target model, an accurate and reliable target answer is obtained through a plurality of target questions, a target answer corresponding to an original input question can be queried from a large amount of data only by consuming a small amount of computing resources, and the time for obtaining the target answer is shorter, so that the computing resources of a server can be saved, the time cost is saved, and the target answer corresponding to the original input question can be rapidly positioned.

For example, a question template may be preconfigured, which may include at least one feature type, indicating that the original input question should provide a key value matching the feature type. The problem templates may be problem templates matched with a certain scene type, for example, the scene type may include, but is not limited to, a behavior abnormality type, a vehicle abnormality type, an industrial abnormality type, etc., and thus, a problem template a1 matched with a behavior abnormality type, a problem template a2 matched with a vehicle abnormality type, a problem template a3 matched with an industrial abnormality type, etc. may be preconfigured, which is, of course, only an example of the scene type, and is not limited thereto.

For example, the problem template a1 may include, but is not limited to, at least one of: time type, location type, behavior event type, which may be a specific behavior, and subsequently, the behavior event type is denoted as behavior X, which is not limited.

The problem template a2 may include, but is not limited to, at least one of: time type, location type, behavior event type, which may be a certain specific vehicle behavior, such as "red light running" behavior, "overspeed" behavior, etc., indicating the event type in which the specific vehicle behavior occurs, without limitation.

The problem template a3 may include, but is not limited to, at least one of: the sound type, vibration type, acceleration type, angular velocity type, temperature type, humidity type, are not limited thereto.

Of course, the above are just a few examples of problem templates, and the problem templates are not limited.

For example, for each scene type, a feature analysis target model corresponding to the scene type may be trained, for example, a feature analysis target model b1 corresponding to a behavior anomaly type, a feature analysis target model b2 corresponding to a vehicle anomaly type, and a feature analysis target model b3 corresponding to an industrial anomaly type may be trained, and of course, the same feature analysis target model may be trained for all scene types.

For training the feature analysis target model b1, sample data related to the "behavior anomaly type" may be collected, the sample data including information related to a time type, information related to a location type, information related to the behavior X, and other types of information may be included in addition to the above information, and the feature analysis target model b1 may be trained based on a large amount of sample data without limitation to the training process.

For training the feature analysis target model b2, sample data related to "vehicle abnormality type" may be collected, the sample data including information related to time type, information related to location type, information related to specific vehicle behavior, and other types of information may be included in addition to the above information, and the feature analysis target model b2 may be trained based on a large amount of sample data without limitation to the training process.

For training the feature analysis target model b3, sample data related to the "industrial abnormality type" may be collected, the sample data including information related to the sound type, information related to the vibration type, information related to the acceleration type, information related to the angular velocity type, information related to the temperature type, information related to the humidity type, and other types of information may be included in addition to the above information, and the feature analysis target model b3 may be trained based on a large amount of sample data without limitation to the training process.

Of course, the above are just a few examples of the target model for feature analysis, which is not limiting.

In the above application scenario, the embodiment of the present application proposes a data processing method based on a multi-mode model, where the method may be applied to a server, as shown in fig. 2, and the method may include:

Step 201, obtaining an original input problem. For example, the original input questions may be received from the user, or may be obtained in other ways, without limitation as to the source of the original input questions.

For example, for an application scenario of the behavioral anomaly type, an example of the original input problem may be as "help me search for pedestrians at an innovative road intersection around 3 pm 21 in 2022, who is most likely the target of behavior X at this point in time at that intersection".

For example, for an application scenario of a vehicle anomaly type, an example of an original input problem may be as "help me search for vehicles that appear at an innovative road intersection near 3 pm on 2022, 3, 21, who is most likely the target of red light running behavior at that intersection at this point in time".

Of course, the above are just two examples of the original input problem, and the original input problem is not limited thereto.

Step 202, determining initial question features matched with the question templates based on the original input questions.

For example, a target scene type (i.e., the scene type of the current application scene) may be determined, and a question template matching the target scene type may be determined, with initial question features matching the question template being determined based on the original input questions. For example, if the target scene type is a behavioral exception type, an initial question feature matching the question template a1 is determined based on the original input question. If the target scene type is a vehicle anomaly type, determining initial problem features matched with the problem template a2 based on the original input problem. If the target scene type is an industrial anomaly type, determining initial problem features matched with the problem template a3 based on the original input problem.

For example, if the problem template matching the target scene type is the problem template a1, and the problem template a1 includes a time type, a location type, and a behavior event type, then a key value "2022, 3 months, 21 pm 3 points" matching the time type is extracted from the original input problem, a key value "innovative road intersection" matching the location type is extracted from the original input problem, a key value "behavior X" matching the behavior event type is extracted from the original input problem, and the key value is used as an initial problem feature, i.e., the initial problem feature includes "2022, 3 pm 3 points 3 months, innovative road intersection" and behavior X ".

For another example, if the problem template matching the target scene type is the problem template a2, and the problem template a2 includes a time type, a location type, and a behavior event type, then a key value "3 pm on day 21 of 3 months in 2022" matching the time type is extracted from the original input problem, a key value "innovative road intersection" matching the location type is extracted from the original input problem, a key value "red light running behavior" matching the behavior event type is extracted from the original input problem, and the key value may be used as the initial problem feature.

In summary, it can be seen that, based on the problem template matching with the target scene type, the initial problem feature can be extracted from the original input problem, and the extraction manner of the initial problem feature is not limited.

In one possible implementation, to determine the initial problem characteristics, the following may be used:

mode 1: for each feature type (such as time type, location type, behavior event type, etc.) in the question template, determining whether a key value matching the feature type exists in the original input question; if yes, determining an initial problem feature based on the key value, and taking the key value as the initial problem feature. If not, analyzing the original input problem to obtain the initial problem feature matched with the feature type.

For example, if there is a key value in the original input question that matches the feature type, the key value may be replaced with a variable using a key value map, thereby taking the key value as the original question feature.

For example, if the key value matched with the feature type does not exist in the original input problem, the original input problem may be analyzed to obtain the key value matched with the feature type, and the key value is used as the original problem feature, which is not limited in the analysis process of the original input problem in this embodiment.

Obviously, when the question template corresponds to a plurality of feature types, a key value matched by each feature type needs to be determined, that is, each key value is taken as an initial question feature, and each feature type corresponds to one initial question feature.

Mode 2: determining whether a key value matched with each feature type exists in the original input problem aiming at each feature type in the problem template; if yes, determining an initial problem feature based on the key value, and taking the key value as the initial problem feature. If not, prompt information aiming at the feature type is sent to the user (the prompt information indicates that a key value matched with the feature type is lacking in the original input questions), the user provides a new original input question based on the prompt information (namely, a key value matched with the feature type is added in the previous original input questions to obtain the new original input questions), and if the key value matched with the feature type exists in the new original input questions, the initial question features can be determined based on the key value.

Mode 3: determining whether a key value matched with each feature type exists in the original input problem aiming at each feature type in the problem template; if yes, determining an initial problem feature based on the key value, and taking the key value as the initial problem feature. If not, analyzing the original input problem, and if the key value matched with the feature type is analyzed, taking the key value as the initial problem feature matched with the feature type. If the key value matched with the feature type is not analyzed, prompt information aiming at the feature type is sent to the user, the user provides a new original input problem based on the prompt information, and if the key value matched with the feature type exists in the new original input problem, the initial problem feature is determined based on the key value.

Of course, the above three methods are merely examples, and the method of acquiring the initial problem feature is not limited.

In summary, in step 202, initial question features that match the question template may be determined based on the original input question, and then the initial question features may be denoted as feature a, feature B, and feature C.

Step 203, obtaining a plurality of additional question features associated with the initial question feature based on the initial question feature.

For example, after the initial problem feature is obtained, a plurality of additional problem features may be obtained based on the initial problem feature, where the additional problem features are additional features based on the initial problem feature, and the additional problem features may be changed and may relate to the initial problem feature. For example, for an application scenario of a behavioral anomaly type, the initial problem features include "behavior X", and the additional problem features are features related to "behavior X", such as a pedestrian speed feature, a pedestrian wearing feature, a pedestrian expression feature, and the like. For an application scenario of a vehicle anomaly type, the initial problem features comprise a red light running behavior, and the additional problem features are features related to the red light running behavior, such as a vehicle speed feature, a license plate identification feature, a vehicle color feature and the like.

In one possible implementation, a number of additional problem features may be obtained as follows:

mode 1: the initial problem feature is input to a trained feature analysis target model, and a plurality of additional problem features associated with the initial problem feature are output by the feature analysis target model.

Illustratively, the feature analysis target model is a feature analysis target model that matches a target scene type, which may be a behavioral anomaly type, a vehicular anomaly type, an industrial anomaly type. For example, if the target scene type is a behavioral abnormality type, the initial problem feature is input to the feature analysis target model b1, the initial problem feature is processed by the feature analysis target model b1, a plurality of additional problem features associated with the initial problem feature are obtained, and a plurality of additional problem features are output. If the target scene type is the abnormal type of the vehicle, the initial problem feature is input into the feature analysis target model b2, the feature analysis target model b2 processes the initial problem feature to obtain a plurality of additional problem features related to the initial problem feature, and the plurality of additional problem features are output. If the target scene type is the industrial anomaly type, the initial problem feature is input into the feature analysis target model b3, the feature analysis target model b3 processes the initial problem feature to obtain a plurality of additional problem features related to the initial problem feature, and the plurality of additional problem features are output.

For example, when the feature analysis target model is obtained based on the sample data training, the sample data may include information about the initial problem feature (such as information about a time type, information about a location type, information about a behavior X), and the sample data may further include other types of information (other types of information may be tag information of the initial problem feature, that is, information reflecting additional problem features), so that the feature analysis target model can reflect a relationship of the initial problem feature and the additional problem feature by constantly learning the optimized feature analysis target model, and a plurality of additional problem features associated with the initial problem feature may be obtained after the initial problem feature is input to the trained feature analysis target model.

Mode 2: the initial problem feature is input to a multi-modal target model (i.e., a multi-modal target large model) from which a plurality of additional problem features associated with the initial problem feature are output.

The multi-modal target model is obtained by continuous learning and optimization based on big data, has strong processing capacity, can be suitable for data processing of all scene types, such as behavioral exception types, vehicle exception types, industrial exception types and the like, and does not limit the sources of the multi-modal target model. After the initial question feature is input into the multi-mode target model, the multi-mode target model processes the initial question feature to obtain a plurality of additional question features associated with the initial question feature, and outputs a plurality of additional question features.

Mode 3: a plurality of additional issue features associated with the initial issue feature are analyzed based on the initial issue feature.

For example, raw data corresponding to an initial question feature may be obtained, and a plurality of additional question features associated with the initial question feature may be analyzed based on the raw data. For example, initial question features include "3 pm at 3 months 21 of 2022", "innovative road intersection", "behavior X", images generated at the innovative road intersection at 3 pm at 3 months 3 of 2022 can be acquired, and additional question features associated with behavior X, such as pedestrian speed features, pedestrian wearing features, pedestrian expression features, etc., are analyzed based on the images. Of course, the above is merely an example of obtaining a plurality of additional problem features based on the analysis of the raw data, and is not limited thereto, as long as a plurality of additional problem features associated with the initial problem features can be obtained by the analysis.

In summary, a plurality of additional problem features associated with the initial problem feature may be obtained, and of course, modes 1, 2 and 3 are merely examples, and the manner of obtaining the additional problem features is not limited.

Step 204, generating a plurality of target questions based on the initial question feature and the plurality of additional question features; wherein each target issue may include the initial issue feature and at least one additional issue feature.

For example, at least one additional issue feature may be selected from a plurality of additional issue features as a candidate additional issue feature, and a target issue may be generated based on the initial issue feature and the candidate additional issue feature. If the plurality of additional issue features includes an additional issue feature c1, an additional issue feature c2, an additional issue feature c3, then a target issue d1 may be generated based on the initial issue feature and the additional issue feature c1 (i.e., the additional issue feature c1 is a candidate additional issue feature), a target issue d2 may be generated based on the initial issue feature and the additional issue feature c2, a target issue d1 may be generated based on the initial issue feature and the additional issue feature c3, a target issue d4 may be generated based on the initial issue feature, the additional issue feature c1, and the additional issue feature c3, and so on.

Illustratively, generating the target question based on the initial question feature and the candidate additional question feature may include, but is not limited to: and combining the initial question feature and the candidate additional question feature, and adding context information to the initial question feature and/or the candidate additional question feature after combining the initial question feature and the candidate additional question feature to obtain a target question corresponding to the candidate additional question feature. For example, the context information may be added to the initial question feature, i.e., before the initial question feature, and/or after the initial question feature, without limitation to the content of the context information. For another example, context information may be added for the candidate additional problem feature. For another example, context information may be added for the initial question feature and context information may be added for the candidate additional question feature.

Step 205, for each target question, the target question is input to a trained multi-modal target model (multi-modal target big model), and a candidate answer corresponding to the target question is output by the multi-modal target model.

The multi-mode target model is obtained by continuous learning and optimization based on big data, has strong processing capacity, can be suitable for data processing of all scene types, and is not limited in source. After the target question is input into the multi-mode target model, the multi-mode target model processes the target question to obtain a candidate answer corresponding to the target question, and outputs the candidate answer.

In summary, the candidate answer corresponding to each target question may be obtained based on the multi-modal target model, such as the candidate answer e1 corresponding to the target question d1, the candidate answer e2 corresponding to the target question d2, the candidate answer e3 corresponding to the target question d3, the candidate answer e4 corresponding to the target question d4, and so on.

For each candidate answer, the candidate answer may be an image-dimensional candidate answer, a text-dimensional candidate answer, a sound-dimensional candidate answer, a video-dimensional candidate answer, or a plurality of-dimensional candidate answers, such as a text-and-video-combination-dimensional candidate answer.

Step 206, determining a target answer corresponding to the original input question based on the candidate answer corresponding to each target question, wherein the target answer may be a final output result for the original input question.

For example, based on the candidate answers corresponding to each target question, repeated candidate answers may be filtered from all candidate answers to obtain at least one remaining candidate answer, and the number of occurrences corresponding to each remaining candidate answer may be determined. Then, a target answer is selected from all the remaining candidate answers based on the number of occurrences corresponding to each remaining candidate answer. For example, determining the remaining candidate answers with the largest occurrence number as target answers corresponding to the original input questions; or if the appearance proportion value corresponding to the maximum appearance times is larger than the preset threshold value, determining the remaining candidate answers with the maximum appearance times as target answers corresponding to the original input questions.

For example, assuming that there are 6 candidate answers in total, candidate answer e1 is mmm, candidate answer e2 is mmm, candidate answer e3 is mmm, candidate answer e4 is mmm, candidate answer e5 is nnn, candidate answer e6 is nnn, then after filtering the repeated candidate answers (i.e. candidate answer e2, candidate answer e3, candidate answer e4, candidate answer e6, etc.), the remaining candidate answers obtained are candidate answer e1 and candidate answer e4. Obviously, the number of occurrences corresponding to the candidate answer e1 is 4, the number of occurrences corresponding to the candidate answer e4 is 2, that is, "mmm" has 4, and "nnn" has 2.

The candidate answer e1 with the largest occurrence number can be used as a target answer corresponding to the original input question, namely, the target answer is mmm. Or, whether the appearance proportion value (i.e. 4/6) corresponding to the maximum appearance times is larger than a preset threshold value or not can be judged, if so, the candidate answer e1 with the maximum appearance times is used as a target answer corresponding to the original input question, and if not, the target answer corresponding to the original input question is not obtained.

In summary, it can be seen that, for the candidate answer with the largest occurrence number, the candidate answer is the most frequently occurring answer, which is a consensus found from all the candidate answers, and has higher confidence, that is, the candidate answer is correct and reliable, so as to ensure the final output accuracy, so that the candidate answer is used as the target answer corresponding to the original input question, and the target answer can be the final output result for the original input question.

Step 207, outputting target data corresponding to the target answer, where the target data may include the target answer, or the target data may include the target answer and additional question features corresponding to the target answer.

Illustratively, the target answers may include, but are not limited to, answers in at least one dimension of images, text, sound, and video. For example, the target answer may be a target answer in an image dimension, a target answer in a text dimension, a target answer in a sound dimension, a target answer in a video dimension, or a target answer in multiple dimensions, such as a target answer in a combination of text and video dimensions.

The target answer may correspond to a plurality of candidate answers, and the additional question features corresponding to the target answer include additional question features corresponding to each candidate answer corresponding to the target answer. For example, when the target answer is mmm, the target answer corresponds to the candidate answer e1, the candidate answer e2, the candidate answer e3 and the candidate answer e4, and thus, the additional question feature corresponding to the target answer may include: the additional question feature corresponding to the candidate answer e1 (the candidate answer e1 is obtained based on the initial question feature and the target question d1 corresponding to the additional question feature c1, and thus, the additional question feature c1 corresponding to the candidate answer e 1), the additional question feature corresponding to the candidate answer e2, the additional question feature corresponding to the candidate answer e3, and the additional question feature corresponding to the candidate answer e 4.

And successfully outputting the target answer of the original input question so as to finish the data processing based on the multi-mode model.

In one possible implementation, to provide the model with self-reasoning capabilities, the complex problem can be broken down into a simpler multi-step procedure and the method verified in each step in a number of ways. For example, referring to fig. 3, a schematic diagram of a self-reasoning method based on a multi-modal model may include the processes of inputting questions, generating a question template, analyzing the logic of the questions, multi-modal model, and technical verification.

First, during the input question, an original input question may be obtained, and an example of the original input question may be, for example, "help me search for pedestrians at an innovative intersection around 3 pm 21 in 2022," who is most likely the target of behavior X at this point in time at that intersection ".

Second, in generating the question template, initial question features that match the question template may be determined based on the original input questions. For example, the original input problem is converted into a mathematical logic form, key values are replaced by variables by using a key value map, the original problem characteristics are obtained, and the modified problem is obtained. For example, the original input problem is modified to be the following: helping me search for pedestrians near A that appear at B who is most likely C. A:2022, 3, 21, 3 pm; b: creating a road crossroad; c: behavior X.

Third, in the problem logic analysis process, a plurality of additional problem features associated with the initial problem feature may be obtained and a plurality of target problems may be generated based on the initial problem feature and the plurality of additional problem features.

For example, based on multiple verification and cross-checking approaches, using different methods to generate an analytical solution, providing hints to a multi-modal object model, generating additional context for the problem, the logic modification problem can be: the method comprises the steps of A and B, and a behavior C, wherein the pedestrian possibly has K characteristics, and K is variable, namely K is an additional problem characteristic, so that a plurality of target problems can be generated by changing the additional problem characteristic K, and the target problems can comprise an initial problem characteristic and the additional problem characteristic.

Fourth, in the multi-modal model process, each target question may be input to the multi-modal target model, and the candidate answer corresponding to each target question may be output by the multi-modal target model. For example, using logic to modify a question (i.e., a target question) to a multi-modal target model to analyze and output a candidate answer, the candidate answer uses the capability of the multi-modal target model, i.e., the multi-modal target model analyzes various dimensions of images, videos, texts, voices, etc., compares the output results, finds consensus among the various dimensions of information, and can also provide higher confidence, so as to obtain a candidate answer corresponding to the target question, i.e., the candidate answer is correct and reliable.

For example, the multi-mode target model automatically analyzes the person with the behavior X, accords with time and place, determines identity information based on biological characteristics, utilizes video structural characteristics such as clothes color, knapsack or the like, and adds big data analysis, so that the most likely target object is obtained through automatic reasoning analysis, and the current possible position of the target object is rapidly positioned by utilizing spatial position information.

Fifth, in the technical verification process, a target answer corresponding to the original input question may be determined based on the candidate answer corresponding to each target question. For example, to ensure consensus among the outputs of various expressions (i.e., multiple target questions), a candidate answer corresponding to each target question may be determined, and the most frequently occurring candidate answer is determined as the target answer, so as to ensure the final output accuracy, and finally, a highest confidence list is continuously and circularly output, and finally, the most likely target object is output, and then, additional question features of the output target object are determined, so that the whole self-reasoning process is completed, and the target answer corresponding to the original input question is obtained.

According to the technical scheme, in the embodiment of the application, the target answers are obtained by means of the multi-mode target model, the accurate and reliable target answers are obtained through the multiple target questions, the target answers corresponding to the original input questions can be queried from a large amount of data only by consuming a small amount of computing resources, the time for obtaining the target answers is shorter, the computing resources of a server are saved, the time expenditure is saved, the target answers corresponding to the original input questions can be rapidly positioned, abnormal logic analysis reasoning capacity is provided, automatic abnormal diagnosis is achieved, the accurate and reliable target answers are given by means of the analysis capacity of the multi-mode target model, and the possibility of various dimensions such as images, videos and voices is comprehensively considered. Through the capability of the multi-modal target model, a user can automatically infer and rapidly position a target object by inputting simple descriptive words, so that the performance of the multi-modal target model on the problem of logical reasoning can be improved, and the multi-modal target model can automatically infer and complete the optimization requirement. The abnormal problems can be rapidly positioned, and meanwhile, abnormal logic analysis reasoning is given, so that automatic abnormal diagnosis is realized.

Based on the same application concept as the above method, in an embodiment of the present application, a data processing device based on a multi-mode model is provided, and referring to fig. 4, a schematic structural diagram of the device is shown, where the device may include:

A determining module 41 for determining initial question features matching the question template based on the original input questions; an acquisition module 42 for acquiring a plurality of additional problem features associated with the initial problem feature; a generating module 43, configured to generate a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question including the initial question feature and at least one additional question feature; the obtaining module 42 is further configured to input, for each target question, the target question to a trained multi-modal target model, and output, by the multi-modal target model, a candidate answer corresponding to the target question; the determining module 41 is further configured to determine a target answer corresponding to the original input question based on the candidate answer corresponding to each target question, where the target answer is an output result of the original input question.

Illustratively, the determining module 41 is specifically configured to, when determining an initial question feature that matches a question template based on an original input question: if the question template comprises a feature type, determining whether a key value matched with the feature type exists in the original input question; if yes, determining the initial problem feature based on the key value; if not, analyzing the original input problem to obtain an initial problem feature matched with the feature type; or sending prompt information aiming at the feature type to a user, providing a new original input problem by the user based on the prompt information, and determining the initial problem feature based on the key value if the key value matched with the feature type exists in the new original input problem.

Illustratively, the acquisition module 42 is specifically configured to, when acquiring a plurality of additional issue features associated with the initial issue feature: inputting the initial problem feature into a trained feature analysis target model, outputting a plurality of additional problem features associated with the initial problem feature from the feature analysis target model; alternatively, the initial question feature is input to the multi-modal object model, and a plurality of additional question features associated with the initial question feature are output by the multi-modal object model; alternatively, a plurality of additional issue features associated with the initial issue feature are analyzed based on the initial issue feature.

Illustratively, the feature analysis target model is a feature analysis target model that matches a target scene type; the problem template is a problem template matched with the target scene type; the target scene type is a behavior anomaly type, a vehicle anomaly type or an industrial anomaly type.

Illustratively, the generating module 43 is specifically configured to, when generating a plurality of target questions based on the initial question feature and the plurality of additional question features: selecting candidate additional problem features from the plurality of additional problem features; and after combining the initial question feature and the candidate additional question feature, adding context information for the initial question feature and/or the candidate additional question feature to obtain a target question corresponding to the candidate additional question feature.

Illustratively, the determining module 41 is specifically configured to, when determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question: filtering repeated candidate answers from all candidate answers based on the candidate answers corresponding to each target question to obtain at least one residual candidate answer, and determining the occurrence times corresponding to each residual candidate answer; determining the residual candidate answers with the largest occurrence number as target answers corresponding to the original input questions; or if the appearance proportion value corresponding to the maximum appearance times is larger than a preset threshold value, determining the remaining candidate answers with the maximum appearance times as target answers corresponding to the original input questions.

Illustratively, the determining module 41 is further configured to, after determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question: outputting target data corresponding to the target answer, wherein the target data comprises the target answer, or the target data comprises the target answer and additional question features corresponding to the target answer; the target answers comprise answers of at least one dimension of images, texts, sounds and videos; the target answer corresponds to a plurality of candidate answers, and the additional question features corresponding to the target answer comprise additional question features corresponding to each candidate answer corresponding to the target answer.

Based on the same application concept as the above method, an electronic device is provided in an embodiment of the present application, and as shown in fig. 5, the electronic device includes: a processor 51 and a machine-readable storage medium 52, the machine-readable storage medium 52 storing machine-executable instructions executable by the processor 51; the processor 51 is configured to execute machine executable instructions to implement the multi-modal model based data processing method disclosed in the above examples of the present application.

Based on the same application concept as the method, the embodiment of the application further provides a machine-readable storage medium, wherein a plurality of computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed by a processor, the data processing method based on the multi-mode model disclosed in the above example of the application can be realized.

Wherein the machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer entity or by an article of manufacture having some functionality. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Moreover, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A method of data processing based on a multimodal model, the method comprising:

determining initial question features that match the configured question template based on the original input questions; if the problem template comprises a feature type and a key value matched with the feature type exists in the original input problem, determining the initial problem feature based on the key value;

generating a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question comprising the initial question feature and at least one additional question feature; wherein the generating a plurality of target questions based on the initial question feature and the plurality of additional question features comprises: selecting candidate additional problem features from the plurality of additional problem features; after the initial question feature and the candidate additional question feature are combined, context information is added to the initial question feature and/or the candidate additional question feature, and a target question corresponding to the candidate additional question feature is obtained;

2. The method of claim 1, wherein the determining initial question features that match the configured question template based on the original input questions further comprises:

If the problem template comprises a feature type and the key value matched with the feature type does not exist in the original input problem, analyzing the original input problem to obtain an initial problem feature matched with the feature type; or sending prompt information aiming at the feature type to a user, providing a new original input problem by the user based on the prompt information, and determining the initial problem feature based on the key value if the key value matched with the feature type exists in the new original input problem.

3. The method of claim 1, wherein the obtaining a plurality of additional question features associated with the initial question feature based on the initial question feature comprises:

inputting the initial problem feature into a trained feature analysis target model, outputting a plurality of additional problem features associated with the initial problem feature from the feature analysis target model; or,

inputting the initial question feature into the multi-modal target model, outputting, by the multi-modal target model, a plurality of additional question features associated with the initial question feature; or,

A plurality of additional issue features associated with the initial issue feature are analyzed based on the initial issue feature.

4. The method of claim 3, wherein the step of,

the characteristic analysis target model is a characteristic analysis target model matched with the target scene type; the problem template is a problem template matched with the target scene type; the target scene type is a behavior anomaly type, a vehicle anomaly type or an industrial anomaly type.

5. The method of claim 1, wherein the determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question comprises:

filtering repeated candidate answers from all candidate answers based on the candidate answers corresponding to each target question to obtain at least one residual candidate answer, and determining the occurrence times corresponding to each residual candidate answer;

determining the remaining candidate answers with the largest occurrence frequency as target answers corresponding to the original input questions; or if the appearance proportion value corresponding to the maximum appearance times is larger than a preset threshold value, determining the remaining candidate answers with the maximum appearance times as target answers corresponding to the original input questions.

6. The method of claim 5, wherein after determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question, the method further comprises:

outputting target data corresponding to the target answer, wherein the target data comprises the target answer, or the target data comprises the target answer and additional question features corresponding to the target answer;

the target answers comprise answers of at least one dimension of images, texts, sounds and videos;

the target answer corresponds to a plurality of candidate answers, and the additional question features corresponding to the target answer comprise additional question features corresponding to each candidate answer corresponding to the target answer.

7. A data processing apparatus based on a multi-modal model, the apparatus comprising:

the determining module is used for determining initial problem characteristics matched with the problem template based on the original input problem; if the problem template comprises a feature type and a key value matched with the feature type exists in the original input problem, determining the initial problem feature based on the key value;

a generation module for generating a plurality of target questions based on the initial question feature and the plurality of additional question features, each target question comprising the initial question feature and at least one additional question feature; the generating module is specifically configured to, when generating a plurality of target questions based on the initial question feature and the plurality of additional question features: selecting candidate additional problem features from the plurality of additional problem features; after the initial question feature and the candidate additional question feature are combined, context information is added to the initial question feature and/or the candidate additional question feature, and a target question corresponding to the candidate additional question feature is obtained;

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the determining module is specifically used for determining initial problem features matched with the problem template based on the original input problem when: if the problem template comprises a feature type and the key value matched with the feature type does not exist in the original input problem, analyzing the original input problem to obtain an initial problem feature matched with the feature type; or, sending prompt information aiming at the feature type to a user, and providing a new original input problem by the user based on the prompt information, wherein if a key value matched with the feature type exists in the new original input problem, the initial problem feature is determined based on the key value;

the acquiring module is specifically configured to, when acquiring a plurality of additional problem features associated with the initial problem feature: inputting the initial problem feature into a trained feature analysis target model, outputting a plurality of additional problem features associated with the initial problem feature from the feature analysis target model; alternatively, the initial question feature is input to the multi-modal object model, and a plurality of additional question features associated with the initial question feature are output by the multi-modal object model; alternatively, analyzing a plurality of additional question features associated with the initial question feature based on the initial question feature;

The feature analysis target model is a feature analysis target model matched with the target scene type; the problem template is a problem template matched with the target scene type; the target scene type is a behavior anomaly type, a vehicle anomaly type or an industrial anomaly type;

the determining module is specifically configured to, when determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question: filtering repeated candidate answers from all candidate answers based on the candidate answers corresponding to each target question to obtain at least one residual candidate answer, and determining the occurrence times corresponding to each residual candidate answer; determining the remaining candidate answers with the largest occurrence frequency as target answers corresponding to the original input questions; or if the appearance proportion value corresponding to the maximum appearance times is larger than a preset threshold value, determining the remaining candidate answers with the maximum appearance times as target answers corresponding to the original input questions;

the determining module is further configured to, after determining the target answer corresponding to the original input question based on the candidate answer corresponding to each target question: outputting target data corresponding to the target answer, wherein the target data comprises the target answer, or the target data comprises the target answer and additional question features corresponding to the target answer; the target answers comprise answers of at least one dimension of images, texts, sounds and videos; the target answer corresponds to a plurality of candidate answers, and the additional question features corresponding to the target answer comprise additional question features corresponding to each candidate answer corresponding to the target answer.

9. An electronic device, comprising: a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor; the processor is configured to execute machine executable instructions to implement the method of any of claims 1-6.