WO2022252239A1

WO2022252239A1 - Computer vision-based mobile terminal application control identification method

Info

Publication number: WO2022252239A1
Application number: PCT/CN2021/098490
Authority: WO
Inventors: 卜佳俊; 张建锋; 周晟; 刘美含; 王炜; 于智
Original assignee: 浙江大学
Priority date: 2021-05-31
Filing date: 2021-06-05
Publication date: 2022-12-08
Also published as: CN113434072A; CN113434072B

Abstract

Disclosed is a computer vision-based mobile terminal application control identification method. In the method, hardware and software methods are combined, and an accessibility function of a system is used, thus achieving a non-invasive, universal and low-error-rate mobile terminal application control identification method. First, a screen reader and corresponding software are opened, and a screenshot is uploaded to a server once a robotic arm is operated. Next, once the screenshot is preprocessed, color extraction and expansion are performed thereon to obtain a single-channel image. Then, edge detection and line detection are performed on the single-channel image, and center coordinates of a control are obtained after noise is filtered. Finally, the function of the control is distinguished by using a computer vision method. The steps above are repeated to build a page control tree of an app. The method may be applied to complex scenarios, achieves the understanding of the functional meaning of each control and has strong universality, and may be applied to scenarios such as the automated testing of mobile applications, page structure decomposition, and human-computer interaction analysis.

Description

Recognition method of mobile application controls based on computer vision

Technical field:

The invention relates to a computer vision-based mobile terminal application non-intrusive control recognition algorithm, which belongs to the field of computer technology software.

Background technique:

The number of mobile applications has exploded with the development of the mobile Internet, and software design has become more and more complex. As a result, the requirements for automated testing, page structure decomposition, and human-computer interaction analysis of mobile applications are also increasing, and these requirements are inseparable from the control identification method based on the Graphical User Interface (GUI), that is, automatic Identify the interactive visual components in the GUI interface. For example, in order to ensure the product quality of mobile applications, GUI automation testing is often required, while the current mainstream "recording and playback" method needs to pre-determine the number, location and interactive operations of controls in the GUI interface.

At present, common control identification methods mostly identify controls based on control properties, which can be mainly divided into three categories: control identification based on coordinates, control identification based on source code, and control identification based on control tree. There are mainly the following defects: (1) It is impossible to realize the understanding of the functional meaning of each control. Existing methods only use attributes to classify controls, but cannot really identify the specific use of each control. In addition, when the attribute value is empty or repeated, the method fails. (2) It cannot be applied to complex scenarios, such as pages with interactive logic such as pop-up windows and sub-pages. (3) Not universal. Due to differences in control identification logic and control call logic between the Android and iOS platforms, the control identification scheme cannot be reused.

Invention content:

Aiming at the above problems and difficulties, the present invention proposes a computer vision-based mobile terminal application control recognition method. Compared with the coordinate-based control identification method, this method can be adapted to different platforms and devices with different resolutions, and has higher universality; compared with the source code-based identification method, this method is non-intrusive, that is, it does not It is necessary to obtain the source code of the software, which can be used in scenarios such as black-box testing, and has a wider range of applications; compared with the control tree-based control identification method, this method is not affected by the page hierarchy and control position, and can flexibly respond to various complex scene. In addition, this method can also realize semantic understanding for each control, and can also identify the specific use of each control in addition to judging the position and attributes of the control.

The specific steps of a computer vision-based mobile application control identification method are as follows: S101: Open the screen reader, its main function is to use Voice describes elements on the screen and frames them with a focus frame. S102: Open the corresponding software, perform an operation on the screen with the robotic arm, and upload the screen shot to the server, and perform image preprocessing. S103: For the screenshot obtained in S102, determine the RGB matrix corresponding to the color of the focus frame, and superimpose the range of RGB according to different backgrounds. S104: Extract the pixels in the RGB range obtained in S103 from the screenshot obtained in S102 to obtain a single-channel image. S105: Expand the image in S104, perform edge detection according to the single-channel image obtained in S104, and detect the edge of the focus frame. S106: Perform line detection according to the edge obtained in S105, and obtain a Cartesian coordinate system equation of the line. S107: Filter the equation obtained in S106 to obtain a straight line corresponding to the focus frame, calculate the center coordinates and length and width of the control, and convert them into screen ratio. S108: According to the central coordinates of the control obtained in S107 and the ratio of the length and width of the screen, take a screenshot of the rectangle framed by the focus frame, and determine its function in a computer vision manner. S109: Construct a corresponding relationship between the coordinates of the control obtained in S107 and the text recognized in S108, and construct a page tree with key-value pairs as nodes. S110: If the control to be clicked is known, then traverse the page tree obtained in S109 to find the node corresponding to the control, the path from the parent node to the target node is the path operated after opening the APP, and according to the percentage of the coordinates of the control in the image, obtain With the physical coordinates on the screen, the robotic arm can directly double-click the control until it finds the target control.

Specifically, in step S101, the specific form of the screen reader used is as follows: S201: the screen reader for Android is Talkback, and the screen reader for iOS is Voiceover.

Specifically, in step S102, the specific requirements for screenshots are: S301: The image must be in png format.

Specifically, in step S102, the specific operations of the mechanical arm are: S401: slide left, slide right, double click.

Specifically, in step S102, the specific scheme of image preprocessing is: S501: Intercepting the screen portion occupied by the graphical user interface of the mobile terminal software.

Specifically in the step S103, the specific method for obtaining the range of the RGB matrix is: S601: determine several RBG matrices corresponding to the focus frame, and obtain the initial range; S602: superimpose different grayscales at the background according to the initial range obtained in S501, to get the final range.

Specifically in the step S104, wherein the acquisition scheme of the single-channel image is: S701: according to the RGB matrix range obtained in S103, traverse in the image matrix obtained in S102; S702: for the pixel in the RGB matrix range, its corresponding value Set to 1; S703: For pixels outside the range of the RGB matrix, set its corresponding value to 0.

In the specifically described step S105, the mode of expanding the image is as follows: S801: the left and right sides of the image are spliced respectively to a matrix whose pixel value is 0 and the width is 50; A matrix with a height of 0 and a height of 50.

In the specifically described step S105, the scheme of edge detection is: S901: Gaussian denoising is carried out to the image; S902: calculate the gradient of the denoised image obtained in S901, and calculate the image edge amplitude and angle according to the gradient; S903: according to S902 The obtained image edge amplitude and angle are subjected to non-maximum suppression along the gradient direction; S904: performing double-threshold edge connection processing to obtain edges.

In the specifically described step S106, the specific scheme of straight line detection is: S1001: convert the coordinates of each point of the image obtained in S904 into polar coordinates; S1002: calculate the straight line equation corresponding to each coordinate, and the coordinates with common straight line equation are in On a straight line; S1003: Count the pixel values on each straight line; S1004: If the pixel value on the straight line obtained by S1003 exceeds a certain threshold, then keep this straight line; S1005: If the pixel value on the straight line obtained by S1003 does not exceed A certain threshold, the straight line is not retained.

Specifically in the step S106, the specific method for obtaining the rectangular coordinate system equation of the straight line is: S1101: converting the polar coordinate equation into a rectangular coordinate equation.

Specifically in step S107, the specific method of screening the straight line corresponding to the focus frame is: S1201: If the difference between the pixel values of two adjacent straight lines satisfies a certain fixed value, it is considered to be the straight line corresponding to the focus frame; S1202: If If the difference between the pixel values of two adjacent straight lines does not satisfy a certain fixed value, it is considered as an interfering straight line.

Specifically in step S107, the specific method for calculating the center coordinates of the control is: S1301: for a vertical straight line, take the mean value to obtain the abscissa of the control; S1302: for a horizontal straight line, take the mean value to obtain the ordinate of the control.

Specifically in step S107, the specific method for calculating the length and width of the control is: S1401: For a vertical straight line, calculate the difference between the maximum value and the minimum value to obtain the width of the control; S1402: For a horizontal straight line, calculate the maximum value and the minimum value The difference between the values gets the length of the control.

Specifically in step S107, the method for calculating the percentage of the control center to the screen is: S1501: divide the abscissa by the image width to obtain the percentage of the x-axis; S1502: divide the ordinate by the length of the image to obtain the percentage of the y-axis;

In the specifically described step S108, the method for specifically determining the function of the control is: S1601: carry out text recognition with OCR to obtain the corresponding text of the control; S1602: if S1601 does not detect the text, perform image matching, and determine its function according to the database that has been built .

In the specifically described step S109, the specific method of constructing the page tree is: S1701: combine the central coordinates of the control and the control function into a key-value pair as a node of the tree; S1702: set an empty node as the root node, move All controls on the home page of the application regard the root node as a parent node; S1703: click on a certain control to jump to all controls on the page as child nodes of the clicked control, and build a page tree by analogy.

In summary, the present invention creates a computer vision-based non-intrusive control recognition algorithm for mobile terminal applications, which has the following beneficial effects: (1) The understanding of the meaning of the function of each control can be realized in addition to the positioning of the control. Know the page hierarchy and function of the control. (2) It can be applied to complex scenarios, such as pages with interactive logic such as pop-up windows and sub-pages. (3) It is universal. Applicable to different platforms and different models.

Description of drawings:

In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

Fig. 1 is the hardware-software interaction diagram of the mobile terminal application non-intrusive control recognition algorithm based on computer vision provided by the present invention;

Fig. 2 is the overall flow chart of the computer vision-based non-intrusive control recognition algorithm for mobile terminal applications provided by the present invention;

Fig. 3 is an example of an image expansion method in the overall flow chart of the computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention;

Fig. 4 is a flow chart of edge detection in the overall flow chart of the computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention;

Fig. 5 is a flow chart of straight line detection in the overall flow chart of the computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention;

Specific implementation method:

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

This example takes an APP as an example. The method includes the following specific steps:

S101: Open the screen reader.

S102: Open the APP, perform a "swipe right" operation on the screen with the robotic arm, and then take a screenshot and upload it, and capture the image.

S103: For the screenshot obtained in S102, determine the RGB matrix range corresponding to the color of the focus frame.

S104: Extract pixels in the RGB range obtained in S103 from the screenshot obtained in S102 to obtain a single-channel image.

S105: Expand the single-channel image in S104, and perform edge detection according to the single-channel image obtained in S104.

S106: Perform line detection according to the edge obtained in S105, and obtain a Cartesian coordinate system equation of the line.

S107: Filter the equation obtained in S106 to obtain a straight line corresponding to the focus frame, calculate the center coordinates and length and width of the control, and convert them into screen ratio.

S108: According to the central coordinates of the control obtained in S107 and the ratio of the length and width of the screen, take a screenshot of the rectangle framed by the focus frame, and determine its function in a computer vision manner.

S109: Construct the coordinates of the control obtained in S107 and the text recognized in S108 to form a corresponding relationship, and construct a page tree with key-value pairs as nodes.

S110: According to the control to be clicked, traverse the page tree obtained in S109 to find the node corresponding to the control, and obtain the physical coordinates on the screen according to the percentage of the coordinates of the control in the image, and the robotic arm can directly double-click the control until the target control is found .

Fig. 3 is an example of an image expansion method in the overall flowchart of the computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention: S801: splicing a matrix with a pixel value of 0 and a width of 50 on the left and right sides of the image respectively ; S802: according to the stitched image in S801, splice a matrix with a pixel value of 0 and a height of 50 on its upper and lower sides respectively;

Fig. 4 is the flow chart of edge detection in the overall flow chart of computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention: S901: Gaussian denoising is performed on the image; The gradient of the image calculates the image edge amplitude and angle according to the gradient; S903: according to the image edge amplitude and angle obtained in S902, non-maximum suppression is carried out along the gradient direction; S904: double-threshold edge connection processing is performed to obtain the edge.

Fig. 5 is the flow chart of straight line detection in the overall flow chart of computer vision-based mobile terminal application non-intrusive control recognition algorithm provided by the present invention: S1001: convert the coordinates of each point of the image obtained in S904 into polar coordinates; S1002 : Calculate the straight line equation corresponding to each coordinate, the coordinates with the common straight line equation are on a straight line; S1003: count the pixel values on each straight line; S1004: if the pixel value on the straight line obtained by S1003 exceeds a certain threshold, then keep This straight line; S1005: If the pixel values on the straight line obtained in S1003 do not exceed a certain threshold, do not keep this straight line.

Claims

A computer vision-based mobile terminal application control recognition method is characterized in that it comprises the following steps:

S101: Turn on the screen reader, whose main function is to describe the elements on the screen with speech and frame them with a focus frame by checking the GUI of the mobile application and the additional information provided by the mobile application for the accessibility features;

S102: Open the corresponding software, perform an operation on the screen with the robotic arm, and then upload the screen shot to the server, and perform image preprocessing;

S103: For the screenshot obtained in S102, determine the RGB matrix corresponding to the color of the focus frame, and superimpose the range of RGB according to different backgrounds;

S104: Extracting pixels in the RGB range obtained in S103 from the screenshot obtained in S102 to obtain a single-channel image;

S105: expand the image of S104, perform edge detection according to the single-channel image obtained in S104, and detect the edge of the focus frame;

S106: Perform line detection according to the edge obtained in S105, and obtain the Cartesian coordinate system equation of the line;

S107: filter the equation obtained in S106 to obtain a straight line corresponding to the focus frame, calculate the center coordinates and length and width of the control, and convert it into a screen ratio;

S108: According to the center coordinates of the control obtained in S107 and the screen ratio of length and width, the rectangular screenshot framed by the focus frame is used to determine its function in a computer vision mode;

S109: form a corresponding relationship between the coordinates of the control obtained in S107 and the text recognized in S108, and construct a page tree with key-value pairs as nodes;

S110: If the control to be clicked is known, then traverse the page tree obtained in S109 to find the node corresponding to the control, the path from the parent node to the target node is the path operated after opening the APP, and according to the percentage of the coordinates of the control in the image, obtain With the physical coordinates on the screen, the robotic arm can directly double-click the control until it finds the target control.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in: said step S101, wherein the specific form of the screen reader used is as follows:

S201: The screen reader for Android is Talkback, and the screen reader for iOS is Voiceover.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S102, the specific requirements for screenshots are:

S301: The image must be in png format.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S102, the specific operation of the mechanical arm is:

S401: slide left, slide right, double click.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: In the step S102, the specific scheme of image preprocessing is:

S501: Intercepting the screen portion occupied by the graphical user interface of the mobile terminal software.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S103, the specific method for obtaining the range of the RGB matrix is:

S601: Determine several RBG matrices corresponding to the focus frame to obtain an initial range; S602: Superimpose different gray levels on the background according to the initial range obtained in S501 to obtain a final range.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S104, the single-channel image acquisition scheme is as follows:

S701: according to the range of the RGB matrix obtained by S103, traverse in the image matrix obtained by S102; S702: for pixels within the range of the RGB matrix, set its corresponding value to 1; S703: for pixels outside the range of the RGB matrix, set Its corresponding value is set to 0.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S105, the way of expanding the image is as follows:

S801: splicing a matrix with a pixel value of 0 and a width of 50 on the left and right sides of the image; S802: splicing a matrix with a pixel value of 0 and a height of 50 on the upper and lower sides of the image spliced according to S801.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S105, the edge detection scheme is:

S901: Carry out Gaussian denoising to the image; S902: Calculate the gradient of the denoised image obtained in S901, and calculate the image edge amplitude and angle according to the gradient; S903: Perform along the gradient direction according to the image edge amplitude and angle obtained in S902 Non-maximum value suppression; S904: Perform double-threshold edge connection processing to obtain edges.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S106, the specific solution for straight line detection is:

S1001: Convert the coordinates of each point of the image obtained in S904 into polar coordinates; S1002: Calculate the straight line equation corresponding to each coordinate, and the coordinates with the common straight line equation are on a straight line; S1003: Count the pixel values on each straight line ; S1004: If the pixel value on the straight line obtained in S1003 exceeds a certain threshold, keep this straight line; S1005: If the pixel value on the straight line obtained in S1003 does not exceed a certain threshold, then do not keep this straight line.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S106, the specific method for obtaining the Cartesian coordinate system equation of the straight line is:

S1101: converting the polar coordinate equation into a rectangular coordinate equation.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: In the step S107, the specific method of screening the straight line corresponding to the focus frame is:

S1201: If the difference between the pixel values of two adjacent straight lines satisfies a certain fixed value, consider it to be the line corresponding to the focus frame; S1202: If the difference between the pixel values of two adjacent straight lines does not satisfy a certain fixed value, consider it to be the line corresponding to the focus frame; is the interference line.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S107, the specific method for calculating the center coordinates of the control is:

S1301: For a vertical straight line, take the mean value to obtain the abscissa of the control; S1302: For a horizontal straight line, take the mean value to obtain the ordinate of the control.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S107, the specific method for calculating the length and width of the control is:

S1401: For a vertical straight line, calculate the difference between the maximum value and the minimum value to obtain the width of the control; S1402: For a horizontal straight line, calculate the difference between the maximum value and the minimum value to obtain the length of the control.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S107, the method for calculating the percentage of the control center to the screen is:

S1501: Divide the abscissa by the image width to obtain the percentage of the x-axis; S1502: divide the ordinate by the image length to obtain the percentage of the y-axis.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S108, the specific method for determining the control function is:

S1601: Use OCR to perform text recognition to obtain the text corresponding to the control; S1602: If no text is detected in S1601, perform image matching, and determine its function according to the constructed database.
A computer vision-based mobile terminal application control recognition method according to claim 1, characterized in that: in the step S109, the specific method of constructing the page tree is:

S1701: Combine the central coordinates of the control and the control function into a key-value pair as a node of the tree; S1702: Set an empty node as the root node, and all controls on the home page of the mobile application will regard the root node as the parent node; S1703: Click a control to jump to all controls on the page, as the child nodes of the clicked control, and so on to build a page tree.