CN114757287A - Automatic testing method based on multi-mode fusion of text and image - Google Patents

Automatic testing method based on multi-mode fusion of text and image Download PDF

Info

Publication number
CN114757287A
CN114757287A CN202210412537.3A CN202210412537A CN114757287A CN 114757287 A CN114757287 A CN 114757287A CN 202210412537 A CN202210412537 A CN 202210412537A CN 114757287 A CN114757287 A CN 114757287A
Authority
CN
China
Prior art keywords
text
image
model
modal
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210412537.3A
Other languages
Chinese (zh)
Inventor
王荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Ruoyi Technology Co ltd
Original Assignee
Nanjing Ruoyi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Ruoyi Technology Co ltd filed Critical Nanjing Ruoyi Technology Co ltd
Priority to CN202210412537.3A priority Critical patent/CN114757287A/en
Publication of CN114757287A publication Critical patent/CN114757287A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an automatic test method based on multi-mode fusion of texts and images, which comprises the following steps: acquiring image data of an entered interface through a camera; acquiring text data through a text detection and text recognition model; sending the image data and the text data into a multi-modal model together for processing, wherein the multi-modal model comprises a convolution layer and a maximum pooling layer for processing the image data, the image data passes through the convolution layer and the maximum pooling layer, and then image modal characteristics are extracted by Resnet, and the multi-modal model also comprises a convolution neural network for processing the text data to obtain text modal characteristics; and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct. The text modal characteristics and the image modal characteristics are subjected to multi-modal fusion through the multi-modal model, so that the accuracy of judging whether to enter a correct interface is higher during automatic testing.

Description

Automatic testing method based on multi-mode fusion of text and image
Technical Field
The invention relates to the technical field of automatic testing, in particular to an automatic testing method based on multi-mode fusion of texts and images.
Background
Testing is an indispensable loop in a well-developed system. A project is in a state of being maintained mainly through fast iteration trend, and the manual maintenance cost can be effectively reduced by introducing an automatic test in a reasonable mode at a reasonable time. The automatic test is divided into an internal test and an external test, and the external test is mainly judged by whether an interface achieves an expected result or not. When the automatic test is operated, a front-end interface is required to be used for judging, whether the path in the test step operates according to the appointed state or not is judged, if the interface is wrong, the test is automatically judged to be failed, and a problem point is found and modified.
In the prior art, a single-mode visual technology judges a label corresponding to an interface only through an image, and because a large number of similar interfaces exist in an interface image, the accuracy rate of the single-mode visual technology is low;
to this end, we propose an automated testing method based on multi-modal fusion of text and images to solve the above problems.
Disclosure of Invention
This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and title of the application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.
The invention is provided in view of the problems of the existing automatic test method based on multi-modal fusion of texts and images.
Therefore, the invention aims to provide an automatic testing method based on multi-mode fusion of texts and images, which aims to improve the accuracy of interface automatic testing.
In order to solve the technical problems, the invention provides the following technical scheme:
an automated testing method based on multi-modal fusion of text and images, comprising the steps of:
the method comprises the following steps: acquiring image data of an entered interface through a camera;
step two: acquiring text data through a text detection and text recognition model;
step three: sending the image data and the text data into a multi-modal model together for processing, wherein the multi-modal model comprises a convolution layer and a maximum pooling layer for processing the image data, the image data passes through the convolution layer and the maximum pooling layer, and then image modal characteristics are extracted by Resnet, and the multi-modal model also comprises a convolution neural network for processing the text data to obtain text modal characteristics;
step four: and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct.
As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: in the third step, Resnet50 is selected for feature extraction of the image, the image modal features extracted by Resnet50 and the text modal features obtained through convolutional neural network processing are sent to a Fusion Block Fusion module to obtain a fused feature layer, finally, a classification result predicted by a full connection layer Dense and a Softmax function calculation model is passed, and Softmax converts multi-classification output values into relative probabilities, so that understanding and comparison are easier.
As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: setting the image modal characteristics extracted by Resnet as Xi, setting the text modal characteristics obtained by the convolutional network as Xt, taking Xi and Xt as the input of a Fusion Block Fusion module, splicing the characteristics of the two image and text modes by means of splicing a full connection layer Dense and concat, introducing a tanh function, and supplementing the low-level text modal characteristics into the high-level characteristics of the image by using add operation, thereby ensuring the integrity of the original structural characteristics of the image mode, and the calculation formula is as follows:
Xtanh=tanh(concat(WiXi+bi,WtXt))
Output of Fusion Block Fusion Module:
Xoutput=add(Xtan□*Xi,Xi);
wherein Wi and Wt are weights of the image and the text mode after passing through the full connection layer Dense respectively, bi represents a deviation, and tanh is an activation function.
As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: the specific operation mode of the second step is as follows:
a. sending the image data obtained in the first step into a text detection model to obtain coordinate data of the text in the image;
b. and cutting out a text image according to the coordinate data, and sending the text image into a text recognition model, thereby predicting all text data in a new image.
As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: in order to ensure the accuracy of the test of the model, the multi-mode model is trained, and the specific mode is as follows: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for the outcome evaluation of the final model.
As a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: the cross entropy loss function is used as a loss function in the multi-modal model training, and the cross entropy can be regarded as the difficulty degree of the probability distribution p (x) represented by the probability distribution q (x) in the deep learning, and the expression is as follows:
Figure BDA0003604278180000031
as a preferred embodiment of the automated testing method based on multi-modal fusion of text and image, the method of the invention comprises: and circularly sending the manufactured training set into a multi-mode model according to the batch size for training, finishing training after n iterations, and storing the trained model structure and weight.
The invention has the beneficial effects that:
1. the text modal characteristics and the image modal characteristics are subjected to multi-modal fusion through the multi-modal model, so that the accuracy of judging whether to enter a correct interface is higher during automatic testing.
2. Resnet is selected as a backbone network for image modal feature extraction, and the problem that the network is easy to degrade in training is solved by using a residual error structure of Resnet;
3. a Tanh activation function is introduced into Fusion Block Fusion, so that the problem of gradient disappearance during network back propagation is reduced to a certain extent while the neural network is nonlinear;
4. In Fusion Block Fusion, a self-attention mechanism of a transmomer is utilized to help a text mode to focus more on features which have greater influence on the result weight;
5. compared with a method only depending on a single image mode, the multi-mode network designed by the invention improves the identification accuracy, and the size of the model is increased by only 1.6M.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:
fig. 1 is a schematic flow chart of an automated testing method based on multi-modal fusion of text and images according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
Example 1
Referring to fig. 1, for a first embodiment of the present invention, there is provided an automated testing method based on multi-modal fusion of text and images, the method comprising the steps of:
the method comprises the following steps: acquiring image data of an entered interface through a camera;
step two: acquiring text data through a text detection and text recognition model, wherein the specific acquisition mode is as follows: sending the image data obtained in the step one into a text detection model to obtain coordinate data of the text in the image; cutting out a text image according to the coordinate data, and sending the text image into a text recognition model so as to predict all text data in a new image;
step three: sending image data and text data into a multi-mode model together for processing, wherein the multi-mode model comprises a convolution layer and a maximum pooling layer for processing the image data, extracting image modal characteristics from Resnet after the image data passes through one layer of convolution layer and one layer of maximum pooling layer, obtaining text modal characteristics from a convolution neural network for processing the text data, specifically selecting Resnet50 for extracting the characteristics of the image, sending the image modal characteristics extracted from Resnet50 and the text modal characteristics obtained by processing the convolution neural network into a Fusion Block Fusion module to obtain a fused characteristic layer, and finally calculating a classification result predicted by a model through a full connection layer and a Softmax function, wherein the Softmax function converts output values of multiple classifications into relative probabilities, and is easier to understand and compare. The purpose is as follows: converting the classification result in the real number range into the probability of- > 0-1; mapping real numbers to 0-positive infinity (not negative) using the property of exponentials; the result of 1 is converted to a probability between 0 and 1 using a normalization method.
The core operation of the fully connected layer density is the matrix vector product, and the essence is that one feature space is linearly transformed into another feature space. Therefore, the purpose of the Dense layer is to extract the correlation between the features extracted previously through nonlinear change in Dense, and finally map the correlation onto the output space.
Step four: and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct.
It should be noted that Resnet is a CNN network structure, and is divided into Resnet18, Resnet50, Resnet101, etc. according to the difference of network depths, Resnet50 is selected for feature extraction of images in the present invention, so that only the last Resnet Block output is obtained, i.e. the last average pooling layer and full connection layer of the original network are discarded.
In addition, in the third step, the image modal feature extracted by the Resnet is set as Xi, the text modal feature obtained by the convolution network is set as Xt, Xi and Xt are used as the input of a Fusion Block Fusion module, self-attention of a transform is used to help the text modal to pay more attention to the feature having a greater influence on the result weight, then the features of the image and the text are spliced together in a full connection layer Dense and concat splicing mode, a tanh function is introduced, and then the low-level text modal feature is supplemented into the high-level feature of the image by using add operation, for the two-path input, if the number of channels is the same and the following coils are stacked, the add is equivalent to the corresponding channel after concat splicing, and the integrity of the original structural feature of the image modal is ensured, the calculation formula is as follows:
Xtanh=tanh(concat(WiXi+bi,WtXt))
Output of Fusion Block Fusion Module
Xoutput=add(Xtan□*Xi,Xi);
Wherein Wi and Wt are weights of the image and the text mode after passing through the full connection layer Dense respectively, bi represents a deviation, and tanh is an activation function.
In order to ensure the accuracy of the test of the multi-modal model, the multi-modal model is trained in the following specific mode: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for result evaluation of a final model, a cross entropy loss function is used as a loss function in multi-mode model training, and cross entropy can be regarded as the difficulty degree of probability distribution p (x) represented by probability distribution q (x) in deep learning, and the expression is as follows:
Figure BDA0003604278180000061
specifically, the manufactured training set is circularly sent to the multi-mode model according to the batch size for training, the training is completed after n iterations, and the trained model structure and weight are stored.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims (7)

1. An automated testing method based on multi-modal fusion of texts and images is characterized by comprising the following steps: the method comprises the following steps:
the method comprises the following steps: acquiring image data of an entered interface through a camera;
step two: acquiring text data through a text detection and text recognition model;
step three: sending the image data and the text data into a multi-modal model together for processing, wherein the multi-modal model comprises a convolution layer and a maximum pooling layer for processing the image data, the image data passes through the convolution layer and the maximum pooling layer, and then image modal characteristics are extracted by Resnet, and the multi-modal model also comprises a convolution neural network for processing the text data to obtain text modal characteristics;
step four: and obtaining the label corresponding to the current image through the multi-mode model, and judging whether the interface is correct.
2. The automated testing method for multimodal fusion of text and images according to claim 1, wherein: in the third step, Resnet50 is selected for image feature extraction, image mode features extracted through Resnet50 and text mode features obtained through convolutional neural network processing are sent to a Fusion Block Fusion module to obtain a fused feature layer, and finally classification results predicted through a full connection layer Dense and a Softmax function calculation model are obtained, and the Softmax function converts multi-classification output values into relative probabilities, so that understanding and comparison are easier.
3. The automated text and image based multimodal fusion testing method of claim 2, wherein: setting the image modal characteristics extracted by Resnet as Xi, setting the text modal characteristics obtained by the convolutional network as Xt, taking Xi and Xt as the input of a Fusion Block Fusion module, splicing the characteristics of the two modes of the image and the text in a full connection layer Dense and concat splicing mode, introducing a tanh function, and supplementing the low-level text modal characteristics into the high-level characteristics of the image by using add operation, thereby ensuring the integrity of the original structural characteristics of the image mode, wherein the calculation formula is as follows:
Xtanh=tanh(concat(WiXi+bi,WtXt))
output of Fusion Block Fusion Module:
Xoutput=add(Xtan□*Xi,Xi);
wherein Wi and Wt are the weights of the image and text modes after passing through the full connection layer Dense, bi represents the deviation, and tanh is an activation function.
4. The automated text and image based multimodal fusion testing method of claim 3, wherein: the specific operation mode of the second step is as follows:
a. sending the image data obtained in the first step into a text detection model to obtain coordinate data of the text in the image;
b. and cutting out a text image according to the coordinate data, and sending the text image into a text recognition model, thereby predicting all text data in a new image.
5. The automated testing method for multimodal fusion based on text and images according to any one of claims 1 to 4, characterized in that: in order to ensure the accuracy of the test of the model, the multi-mode model is trained, and the specific mode is as follows: collecting images and text data of each interface, labeling corresponding labels for each group of image text data, and performing the following steps according to the sequence of 8: 1: 1, dividing the training set into a training set, a verification set and a test set, wherein training set data is used for model training; the verification set data is used for verifying the performance of the model during the training of the model so as to observe the training effect of the model; the test set data is used for the outcome evaluation of the final model.
6. The automated testing method for multimodal fusion of text and images according to claim 5, wherein: the cross entropy loss function is used as a loss function in the multi-modal model training, and the cross entropy can be regarded as the difficulty degree of the probability distribution p (x) represented by the probability distribution q (x) in the deep learning, and the expression is as follows:
Figure 1
7. the automated testing method for multimodal fusion of text and images according to claim 6, wherein: and circularly sending the manufactured training set into a multi-mode model according to the batch size for training, finishing training after n iterations, and storing the trained model structure and weight.
CN202210412537.3A 2022-04-19 2022-04-19 Automatic testing method based on multi-mode fusion of text and image Pending CN114757287A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210412537.3A CN114757287A (en) 2022-04-19 2022-04-19 Automatic testing method based on multi-mode fusion of text and image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210412537.3A CN114757287A (en) 2022-04-19 2022-04-19 Automatic testing method based on multi-mode fusion of text and image

Publications (1)

Publication Number Publication Date
CN114757287A true CN114757287A (en) 2022-07-15

Family

ID=82330973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210412537.3A Pending CN114757287A (en) 2022-04-19 2022-04-19 Automatic testing method based on multi-mode fusion of text and image

Country Status (1)

Country Link
CN (1) CN114757287A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973294A (en) * 2022-07-28 2022-08-30 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973294A (en) * 2022-07-28 2022-08-30 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium
CN114973294B (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN112069921A (en) Small sample visual target identification method based on self-supervision knowledge migration
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN109902202B (en) Video classification method and device
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN107291775B (en) Method and device for generating repairing linguistic data of error sample
CN112184508A (en) Student model training method and device for image processing
CN111653275B (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
CN110135505B (en) Image classification method and device, computer equipment and computer readable storage medium
CN110598603A (en) Face recognition model acquisition method, device, equipment and medium
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN116610803B (en) Industrial chain excellent enterprise information management method and system based on big data
CN111428750A (en) Text recognition model training and text recognition method, device and medium
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN114757287A (en) Automatic testing method based on multi-mode fusion of text and image
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN111309921A (en) Text triple extraction method and extraction system
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning
CN111242114B (en) Character recognition method and device
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN111259197A (en) Video description generation method based on pre-coding semantic features
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN114419514A (en) Data processing method and device, computer equipment and storage medium
CN113159071A (en) Cross-modal image-text association anomaly detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination