CN114757287A

CN114757287A - Automatic testing method based on multi-mode fusion of text and image

Info

Publication number: CN114757287A
Application number: CN202210412537.3A
Authority: CN
Inventors: 王荣
Original assignee: Nanjing Ruoyi Technology Co ltd
Current assignee: Nanjing Ruoyi Technology Co ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-07-15

Abstract

The invention discloses an automatic test method based on multi-mode fusion of texts and images, which comprises the following steps: acquiring image data of an entered interface through a camera; acquiring text data through a text detection and text recognition model; sending the image data and the text data into a multi-modal model together for processing, wherein the multi-modal model comprises a convolution layer and a maximum pooling layer for processing the image data, the image data passes through the convolution layer and the maximum pooling layer, and then image modal characteristics are extracted by Resnet, and the multi-modal model also comprises a convolution neural network for processing the text data to obtain text modal characteristics; and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct. The text modal characteristics and the image modal characteristics are subjected to multi-modal fusion through the multi-modal model, so that the accuracy of judging whether to enter a correct interface is higher during automatic testing.

Description

Automatic testing method based on multi-mode fusion of text and image

Technical Field

The invention relates to the technical field of automatic testing, in particular to an automatic testing method based on multi-mode fusion of texts and images.

Background

Testing is an indispensable loop in a well-developed system. A project is in a state of being maintained mainly through fast iteration trend, and the manual maintenance cost can be effectively reduced by introducing an automatic test in a reasonable mode at a reasonable time. The automatic test is divided into an internal test and an external test, and the external test is mainly judged by whether an interface achieves an expected result or not. When the automatic test is operated, a front-end interface is required to be used for judging, whether the path in the test step operates according to the appointed state or not is judged, if the interface is wrong, the test is automatically judged to be failed, and a problem point is found and modified.

In the prior art, a single-mode visual technology judges a label corresponding to an interface only through an image, and because a large number of similar interfaces exist in an interface image, the accuracy rate of the single-mode visual technology is low;

to this end, we propose an automated testing method based on multi-modal fusion of text and images to solve the above problems.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and title of the application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The invention is provided in view of the problems of the existing automatic test method based on multi-modal fusion of texts and images.

Therefore, the invention aims to provide an automatic testing method based on multi-mode fusion of texts and images, which aims to improve the accuracy of interface automatic testing.

In order to solve the technical problems, the invention provides the following technical scheme:

an automated testing method based on multi-modal fusion of text and images, comprising the steps of:

the method comprises the following steps: acquiring image data of an entered interface through a camera;

step two: acquiring text data through a text detection and text recognition model;

step three: sending the image data and the text data into a multi-modal model together for processing, wherein the multi-modal model comprises a convolution layer and a maximum pooling layer for processing the image data, the image data passes through the convolution layer and the maximum pooling layer, and then image modal characteristics are extracted by Resnet, and the multi-modal model also comprises a convolution neural network for processing the text data to obtain text modal characteristics;

step four: and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct.

As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: in the third step, Resnet50 is selected for feature extraction of the image, the image modal features extracted by Resnet50 and the text modal features obtained through convolutional neural network processing are sent to a Fusion Block Fusion module to obtain a fused feature layer, finally, a classification result predicted by a full connection layer Dense and a Softmax function calculation model is passed, and Softmax converts multi-classification output values into relative probabilities, so that understanding and comparison are easier.

As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: setting the image modal characteristics extracted by Resnet as Xi, setting the text modal characteristics obtained by the convolutional network as Xt, taking Xi and Xt as the input of a Fusion Block Fusion module, splicing the characteristics of the two image and text modes by means of splicing a full connection layer Dense and concat, introducing a tanh function, and supplementing the low-level text modal characteristics into the high-level characteristics of the image by using add operation, thereby ensuring the integrity of the original structural characteristics of the image mode, and the calculation formula is as follows:

X_tanh＝tanh(concat(W_iX_i+b_i,W_tX_t))

Output of Fusion Block Fusion Module:

X_output＝add(X_tan□*X_i,X_i)；

wherein Wi and Wt are weights of the image and the text mode after passing through the full connection layer Dense respectively, bi represents a deviation, and tanh is an activation function.

As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: the specific operation mode of the second step is as follows:

a. sending the image data obtained in the first step into a text detection model to obtain coordinate data of the text in the image;

b. and cutting out a text image according to the coordinate data, and sending the text image into a text recognition model, thereby predicting all text data in a new image.

As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: in order to ensure the accuracy of the test of the model, the multi-mode model is trained, and the specific mode is as follows: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for the outcome evaluation of the final model.

As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: the cross entropy loss function is used as a loss function in the multi-modal model training, and the cross entropy can be regarded as the difficulty degree of the probability distribution p (x) represented by the probability distribution q (x) in the deep learning, and the expression is as follows:

as a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: and circularly sending the manufactured training set into a multi-mode model according to the batch size for training, finishing training after n iterations, and storing the trained model structure and weight.

The invention has the beneficial effects that:

1. the text modal characteristics and the image modal characteristics are subjected to multi-modal fusion through the multi-modal model, so that the accuracy of judging whether to enter a correct interface is higher during automatic testing.

2. Resnet is selected as a backbone network for image modal feature extraction, and the problem that the network is easy to degrade in training is solved by using a residual error structure of Resnet;

3. a Tanh activation function is introduced into Fusion Block Fusion, so that the problem of gradient disappearance during network back propagation is reduced to a certain extent while the neural network is nonlinear;

4. In Fusion Block Fusion, a self-attention mechanism of a transmomer is utilized to help a text mode to focus more on features which have greater influence on the result weight;

5. compared with a method only depending on a single image mode, the multi-mode network designed by the invention improves the identification accuracy, and the size of the model is increased by only 1.6M.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic flow chart of an automated testing method based on multi-modal fusion of text and images according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Example 1

Referring to fig. 1, for a first embodiment of the present invention, there is provided an automated testing method based on multi-modal fusion of text and images, the method comprising the steps of:

step two: acquiring text data through a text detection and text recognition model, wherein the specific acquisition mode is as follows: sending the image data obtained in the step one into a text detection model to obtain coordinate data of the text in the image; cutting out a text image according to the coordinate data, and sending the text image into a text recognition model so as to predict all text data in a new image;

step three: sending image data and text data into a multi-mode model together for processing, wherein the multi-mode model comprises a convolution layer and a maximum pooling layer for processing the image data, extracting image modal characteristics from Resnet after the image data passes through one layer of convolution layer and one layer of maximum pooling layer, obtaining text modal characteristics from a convolution neural network for processing the text data, specifically selecting Resnet50 for extracting the characteristics of the image, sending the image modal characteristics extracted from Resnet50 and the text modal characteristics obtained by processing the convolution neural network into a Fusion Block Fusion module to obtain a fused characteristic layer, and finally calculating a classification result predicted by a model through a full connection layer and a Softmax function, wherein the Softmax function converts output values of multiple classifications into relative probabilities, and is easier to understand and compare. The purpose is as follows: converting the classification result in the real number range into the probability of- > 0-1; mapping real numbers to 0-positive infinity (not negative) using the property of exponentials; the result of 1 is converted to a probability between 0 and 1 using a normalization method.

The core operation of the fully connected layer density is the matrix vector product, and the essence is that one feature space is linearly transformed into another feature space. Therefore, the purpose of the Dense layer is to extract the correlation between the features extracted previously through nonlinear change in Dense, and finally map the correlation onto the output space.

It should be noted that Resnet is a CNN network structure, and is divided into Resnet18, Resnet50, Resnet101, etc. according to the difference of network depths, Resnet50 is selected for feature extraction of images in the present invention, so that only the last Resnet Block output is obtained, i.e. the last average pooling layer and full connection layer of the original network are discarded.

In addition, in the third step, the image modal feature extracted by the Resnet is set as Xi, the text modal feature obtained by the convolution network is set as Xt, Xi and Xt are used as the input of a Fusion Block Fusion module, self-attention of a transform is used to help the text modal to pay more attention to the feature having a greater influence on the result weight, then the features of the image and the text are spliced together in a full connection layer Dense and concat splicing mode, a tanh function is introduced, and then the low-level text modal feature is supplemented into the high-level feature of the image by using add operation, for the two-path input, if the number of channels is the same and the following coils are stacked, the add is equivalent to the corresponding channel after concat splicing, and the integrity of the original structural feature of the image modal is ensured, the calculation formula is as follows:

X_tanh＝tanh(concat(W_iX_i+b_i,W_tX_t))

Output of Fusion Block Fusion Module

X_output＝add(X_tan□*X_i,X_i)；

In order to ensure the accuracy of the test of the multi-modal model, the multi-modal model is trained in the following specific mode: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for result evaluation of a final model, a cross entropy loss function is used as a loss function in multi-mode model training, and cross entropy can be regarded as the difficulty degree of probability distribution p (x) represented by probability distribution q (x) in deep learning, and the expression is as follows:

specifically, the manufactured training set is circularly sent to the multi-mode model according to the batch size for training, the training is completed after n iterations, and the trained model structure and weight are stored.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. An automated testing method based on multi-modal fusion of texts and images is characterized by comprising the following steps: the method comprises the following steps:

2. The automated testing method for multimodal fusion of text and images according to claim 1, wherein: in the third step, Resnet50 is selected for image feature extraction, image mode features extracted through Resnet50 and text mode features obtained through convolutional neural network processing are sent to a Fusion Block Fusion module to obtain a fused feature layer, and finally classification results predicted through a full connection layer Dense and a Softmax function calculation model are obtained, and the Softmax function converts multi-classification output values into relative probabilities, so that understanding and comparison are easier.

3. The automated text and image based multimodal fusion testing method of claim 2, wherein: setting the image modal characteristics extracted by Resnet as Xi, setting the text modal characteristics obtained by the convolutional network as Xt, taking Xi and Xt as the input of a Fusion Block Fusion module, splicing the characteristics of the two modes of the image and the text in a full connection layer Dense and concat splicing mode, introducing a tanh function, and supplementing the low-level text modal characteristics into the high-level characteristics of the image by using add operation, thereby ensuring the integrity of the original structural characteristics of the image mode, wherein the calculation formula is as follows:

X_tanh＝tanh(concat(W_iX_i+b_i,W_tX_t))

output of Fusion Block Fusion Module:

X_output＝add(X_tan□*X_i,X_i)；

wherein Wi and Wt are the weights of the image and text modes after passing through the full connection layer Dense, bi represents the deviation, and tanh is an activation function.

4. The automated text and image based multimodal fusion testing method of claim 3, wherein: the specific operation mode of the second step is as follows:

5. The automated testing method for multimodal fusion based on text and images according to any one of claims 1 to 4, characterized in that: in order to ensure the accuracy of the test of the model, the multi-mode model is trained, and the specific mode is as follows: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for the outcome evaluation of the final model.

6. The automated testing method for multimodal fusion of text and images according to claim 5, wherein: the cross entropy loss function is used as a loss function in the multi-modal model training, and the cross entropy can be regarded as the difficulty degree of the probability distribution p (x) represented by the probability distribution q (x) in the deep learning, and the expression is as follows:

。

7. the automated testing method for multimodal fusion of text and images according to claim 6, wherein: and circularly sending the manufactured training set into a multi-mode model according to the batch size for training, finishing training after n iterations, and storing the trained model structure and weight.