CN117196546A

CN117196546A - RPA flow executing system and method based on page state understanding and large model driving

Info

Publication number: CN117196546A
Application number: CN202311478926.7A
Authority: CN
Inventors: 宋志龙
Original assignee: Hangzhou Real Intelligence Technology Co ltd
Current assignee: Hangzhou Real Intelligence Technology Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2023-12-08

Abstract

The invention belongs to the technical field of RPA flow configuration, and particularly relates to an RPA flow execution system and method based on page state understanding and large model driving. The system comprises: the business process disassembly module is used for disassembling the business demand instruction described by the language into a specific operation step instruction; the page state understanding and target positioning module is used for describing page contents and positioning target elements; the action execution module is used for receiving the operation step instruction from the business process disassembly module and the target element position from the page understanding and target positioning module, and executing corresponding operation actions through component call. The invention has the characteristics that the process disassembly, page identification and action execution can be completed only by describing the service requirement of the user through natural language in detail.

Description

RPA flow executing system and method based on page state understanding and large model driving

Technical Field

The invention belongs to the technical field of RPA flow configuration, and particularly relates to an RPA flow execution system and method based on page state understanding and large model driving.

Background

Robot process automation (Robotic Process Automation, RPA) is an automation technology, and by simulating human actions, a series of operations such as clicking and inputting are performed in a computer instead of a human, so that the working efficiency can be greatly improved, and the human power is liberated from complex tasks with strong regularity and strong repeatability.

The existing RPA system generally needs to construct an automatic process in a form of dragging the component, requires business personnel to deeply learn the RPA client before corresponding each operation step to the proper component to construct the process, and needs to have certain programming thinking and certain learning and using thresholds.

At present, the process construction of RPA mainly relates to the following technology:

rpa (robot procedure automation) technique:

the RPA technology replaces a person to perform a series of operations such as clicking, inputting and the like in a computer by simulating the action of the person, can greatly improve the working efficiency, and releases the manpower from the complicated tasks with strong regularity and strong repeatability. The common RPA process construction form is drag form for action input, the user corresponds each click, input and other actions in the operation process to the components in the RPA, and the whole set of RPA process is finally formed.

2. Page element identification techniques:

page element identification is an important component of the RPA technology, whose core is to manipulate page elements, and thus it is a fundamental requirement to identify page elements. Including but not limited to element recognition, OCR word recognition, icon element classification, etc., with the objective of obtaining the location of each element on the page, text content, icon meaning for use in RPA flow execution.

3. Large model technology:

large models refer to very large neural network models in a series of deep learning, and are commonly referred to as large models because they typically have hundreds of millions or even billions of parameters. The large language model represented by ChatGPT integrates various capabilities of Natural Language Processing (NLP) tasks, such as questions and answers, summaries, reasoning, etc. The powerful reasoning capability also provides support for image understanding, so that a large language model is combined with an image encoder to be integrated into a visual-language multi-mode large model for image feature training, and the large language model has the capabilities of image description and target positioning, such as GLIP, kosmos-2, qwen-VL and the like.

However, the above-described related art has the following limitations:

1. the existing RPA system still has a certain use threshold and complicated interaction action, and has poor anti-interference performance:

while the component recommendation mode of IPA has greatly reduced the usage threshold of RPA, so that users can build automated processes without learning hundreds of process components, a need for more familiar operational processes is still felt. If a simple leave-out action is required to be executed on one OA system, the operation steps of different OA systems are inconsistent, and a user needs to be familiar with a business process before the process is built; and a series of clicking and inputting actions are needed to be manually carried out during the process construction, and the whole process is still complicated. In addition, the anti-interference capability of the conventional RPA system is poor, the built fixed flow can only operate according to fixed steps, and once abnormal conditions occur, the execution can be directly failed. If a normal login process is performed after the account password is input to click the login, the login can be completed, but part of webpages can occasionally pop up the verification code page after the login is clicked, and the preset process of the fixed step cannot cope with the situation, so that the operation of the process fails.

2. The existing page recognition technology can only acquire the element information of the fracture:

to implement execution of the process through language description, the RPA system needs to understand the language content and give feedback in combination with the page status. Specifically, when the action "click on login button" needs to be performed, the page recognition model should be able to output the coordinates of "login button". In the existing page recognition technology, different models are used for recognizing different contents, such as detecting model recognition icons, input boxes and the like, and an OCR model is used for recognizing text contents, but the acquired element information cannot be related, so that the recognized element information cannot be fully utilized. If an account number and a password need to be input on one login interface, the detection model recognizes two input boxes, and the OCR recognizes the fields of the account number and the password, but it is difficult to directly distinguish the input boxes of the account number and the password, so that a target element cannot be directly positioned through language description.

Therefore, it is important to design an RPA process execution system and method based on page state understanding and large model driving, which can complete process disassembly, page identification and action execution by only describing own business requirements in detail through natural language.

Disclosure of Invention

The invention aims to solve the problems of high use threshold and poor convenience in the conventional RPA process construction in the prior art, and provides an RPA process execution system and method based on page state understanding and large model driving, which can finish process disassembly, page identification and action execution only by describing own service requirements in detail through natural language by a user.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

an RPA process execution system based on page state understanding and large model driving, comprising:

the business process disassembly module is used for disassembling the business demand instruction described by the language into a specific operation step instruction;

the page state understanding and target positioning module is used for describing page contents and positioning target elements;

the action execution module is used for receiving the operation step instruction from the business process disassembly module and the target element position from the page understanding and target positioning module, and executing corresponding operation actions through component call.

Preferably, the business process disassembly module comprises a large language model; the large language model is trained by a general large language model in a corpus containing data samples of business instructions and flow steps.

Preferably, the generic large language model comprises ChatGPT.

Preferably, the page status understanding and targeting module comprises a visual-language multi-modal large model for understanding and identifying computer pages; the visual-language multi-mode big model is obtained by training a plurality of web pages and image data samples of an application software interface by a basic multi-mode big model;

the image data sample comprises a page screenshot, descriptions of elements of the page and descriptions of relationships among the elements.

Preferably, the multimodal large model includes a Kosmos-2 model and a Qwen-vl model.

Preferably, in the page status understanding and target positioning module, the process of describing the page content and positioning the target element specifically includes:

positioning the position of a target element for executing the operation required by the corresponding action according to the operation step instruction disassembled by the business process disassembly module; if the target element position cannot be positioned, describing the page state, feeding back to a large language model responsible for disassembling the business instruction, and providing a reference for the adjustment of the execution action.

Preferably, in the action execution module, the execution of the corresponding operation action by the component call includes mouse click and keyboard input.

The invention also provides an RPA flow execution method based on page state understanding and large model driving, which comprises the following steps of;

s1, inputting task description into a business process disassembly module, and disassembling action sequences A1, A2, A.and An;

s2, inputting the single-step action instruction An and the current state screenshot of the corresponding operation page into a page state understanding and target positioning module;

s3, carrying out page understanding and positioning of target elements corresponding to the action instructions An by a vision-language multi-mode large model in the page state understanding and target positioning module;

if the positioning is successful, inputting the position of the positioned target element into An action executing module, and calling An executing action instruction An through a component; if positioning fails, the page state understanding and target positioning module feeds back the information of failure positioning of the target element and the description of the page state to a large language model in the business process dismantling module, and the large language model adjusts the next action sequence according to feedback content so as to ensure that the process is executed smoothly;

s4, if the action sequence is executed, ending the execution; otherwise, the process of steps S1 to S3 is repeated.

Compared with the prior art, the invention has the beneficial effects that: (1) According to the invention, the user only needs to describe own service requirements in detail through natural language, and the RPA intelligent agent can finish process disassembly, page identification and action execution; moreover, by means of strong understanding and reasoning capability of the large language model, the disassembled execution plan can be dynamically adjusted according to the page state and the execution target; (2) In order to support the capacity of each module of the RPA intelligent agent system, the invention firstly constructs a corpus of business instruction-flow step, and uses the corpus to fine tune a general large language model, so that the large language model has the capacity of business instruction disassembly; meanwhile, as a large number of action sequences are learned, a large language model with strong reasoning capability has the capability of adjusting a subsequent execution plan in real time according to the current page state; then, a large number of 'page state description and page target description' graphic-text data sets are constructed, and a 'vision-language multi-mode big model' is finely adjusted by the graphic-text data sets, so that the conventional 'vision-language multi-mode big model' has the understanding capability of a computer page and the positioning capability of page elements, and when the process is executed, the positions of corresponding operation elements can be output only by receiving the disassembled action description; (3) The invention can not only greatly reduce the threshold of the RPA user and greatly improve the usability and convenience of the RPA system, but also deal with some abnormal conditions in the process of executing the flow and greatly improve the stability of executing the flow.

Drawings

FIG. 1 is a schematic diagram of an overall functional architecture of an RPA flow execution system based on page state understanding and large model driving in accordance with the present invention;

FIG. 2 is a schematic diagram of a construction scheme of the RPA flow execution system based on page status understanding and large model driving in the present invention;

FIG. 3 is a flowchart of an RPA flow execution system based on page status understanding and large model driving in practical application according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data sample of "business instruction-flow step" according to the present invention;

FIG. 5 is a schematic diagram of a data sample of "Page State description and Page target description" in the present invention;

FIG. 6 is a schematic diagram of the instruction disassembly process according to the present invention;

FIG. 7 is a schematic diagram of a target element positioning process according to the present invention;

FIG. 8 is a schematic diagram of component invocation and execution in accordance with the present invention;

FIG. 9 is a schematic diagram illustrating a process for locating failed page-turning status of a target element according to the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.

As shown in FIG. 1, the present invention provides an RPA flow execution system based on page status understanding and large model driving, comprising:

Specifically, the business process disassembly module is mainly responsible for disassembling the business instructions described by natural language into single-step execution steps which can correspond to corresponding operation components. It contains a large language model that is trimmed from a basic generic large language model (including but not limited to the generic large language model such as ChatGPT) in a corpus containing a large number of "business instruction-flow step" data samples. Therefore, the method has the capability of disassembling the language-described business instructions into specific executable operation steps in addition to the capability of a general large language model. The data sample form is like "notify Zhang Sanlai 301 room meeting on enterprise WeChat-1. Open enterprise WeChat. 2. "Zhang Sano" is entered in the search box. 3. "to 301 a room meeting" is entered in the chat input box. 4. Clicking the send button. ". In addition, the large language model further has the capability of modifying the execution plan according to page status feedback, for example, when "search for three" is prepared after opening an enterprise WeChat ", the large language model with dialogue and reasoning capability will firstly inquire login account information and add the actions" login enterprise WeChat, account xxx, password xxx "to the disassembled step, and then complete the login of the account at the login interface, thereby executing the subsequent actions.

The page state understanding and targeting module has the ability to describe the content of the page and target elements, and comprises a "visual-language multi-modal large model" capable of understanding and identifying computer pages, which model is derived from a basic "visual-language multi-modal large model" (including but not limited to Kosmos-2, qwen-vl multi-modal large models) fine-tuned over a large number of image data samples of web pages and application software interfaces. Each data sample contains a screenshot, a description of each element of the page and a description of the relationship between the elements, so that the data sample has the capability of positioning targets according to natural language description. On the one hand, the module is responsible for locating the position of a target element which is required to execute the operation according to the step disassembled by the business process disassembling module, for example, when receiving the 'input account xxx' action, outputting the coordinates of the 'account input box'. On the other hand, the method is responsible for describing the page state when the target element position cannot be positioned, feeding back the page state to a large language model which is responsible for disassembling the business instruction, and providing reference for adjustment of the execution action. For example, when executing the action of "login- > search news", sometimes a verification code page is popped up after login, so that the model cannot be positioned to an "input box element" of the input search content, at this time, the module can describe the state of the page, and the information of "no input box is detected, and the current page is a verification code verification interface" is fed back to the large language model of the business instruction disassembly module.

The action execution module is used for completing the actions such as mouse clicking, keyboard input and the like through component calling. And receiving an action instruction from a business process disassembly module and a target element position from a page understanding and target positioning module to execute corresponding actions. And if the action of clicking the send button and the element position of the send button are received, calling a clicking component of the RPA to control the mouse to move and finish clicking action.

Through the whole set of 'instruction receiving and disassembling', 'target element positioning feedback' and 'action executing' related functional modules, an RPA intelligent agent system which can support the control of a computer to execute specific business through language description can be formed.

In addition, as shown in fig. 1, the invention also provides an RPA flow executing method based on page state understanding and large model driving, which comprises the following steps of;

1, inputting task description into a business process disassembly module, and disassembling action sequences A1, A2, and An;

2, inputting the single-step action instruction An and the current state screenshot of the corresponding operation page into a page state understanding and target positioning module;

3, carrying out page understanding and positioning of target elements corresponding to the action instructions An by a vision-language multi-mode large model in the page state understanding and target positioning module;

4, ending execution if the execution of the action sequence is finished; otherwise, repeating the processes from the step 1 to the step 3.

FIG. 2 is a schematic diagram of the system construction scheme of the RPA intelligent system of the invention. After a task instruction described by natural language is input into An RPA intelligent agent system, the task instruction is disassembled into A1, A and A by a large language model based on Llama-V2 instruction disassembly; then, carrying out target element positioning on a current execution interface according to a current step instruction by using a Kosmos-2-based page state understanding and target positioning vision-language multi-mode large model; if the target element is successfully positioned, calling an RPA action execution module to execute the operation; otherwise, the description information of the page state by the positioning failure and the multi-mode large model is transmitted into the instruction disassembly large model to carry out adjustment of the execution steps, and then the operation is continued until the task is completed.

Based on the scheme of the invention, as shown in fig. 3, an embodiment of the invention is shown by a practical use case, and some beneficial effects are shown:

the RPA agent system provided by the present invention automatically executes the flow of "search for men's coats in the panned net, the jindong net and the spell many ways", and prompts "if login is needed, the login account is 123, and the password is abc". According to experience, the Taobao network is generally not required to be logged in, and the Jingdong network and the Jian Duo network are generally required to be logged in and then the commodities can be searched. The experience content is recorded in a constructed corpus of business instruction-flow step:

1. firstly, performing fine tuning on a constructed corpus of business instruction-flow step by using Llama-V2 (a business flow disassembly large language model can be obtained by training based on other large language model bases) so that the corpus has the flow disassembly and action sequence adjustment capabilities; then, a Kosmos-2 (which can be realized based on other multi-mode large language model bases) is used for fine-tuning a 'vision-language multi-mode large model' on a constructed 'page state description and page target description' data set, so that the model has the understanding capability of a computer page and the positioning capability of page elements.

The "business instruction-flow step" sample is shown in fig. 4, for example, "business instruction" is to give three messages on the nail, let him come to the conference room, and the corresponding "flow step" includes: 1. starting a nailing application program; 2. inputting 'Zhang Sanj' in a contact person search box and returning; 2. entering "come to meeting room" in the message box; 4. clicking the send button.

The data samples of the "page state description and the" page target description "are shown in fig. 5, for example, the action description is to click on an account input box, the corresponding page state description is that the page is an" account login interface ", an account password can be input for login, and the corresponding page target description is an account input box < box > (x 1, y 1), (x 2, y 2) </box >.

And finally, combining the action execution capacity of the RPA system to complete the construction of the RPA flow execution agent system based on page state understanding and large model driving.

2. As shown in fig. 6, the service description "search for men's coat in the naughty net, if login is needed, the login account is 123, the password is abc" is input into the RPA intelligent system, and the intelligent system operates according to the following steps:

2.1. firstly, an action sequence of opening a Taobao website- > clicking a search box- > inputting a man coat- > clicking a search is obtained through a business process disassembly module.

2.2. As shown in fig. 7, the action "open panning website": the page state understanding module receives the action description, opens the Taobao website and the page screenshot, and outputs < box > (x 1, y 1), (x 2, y 2) </box > (namely, the chrome icon coordinates are (x, y) ") through Kosmos-2-based page state understanding and target positioning vision-language multi-mode large models; as shown in fig. 8, the "action execution module" invokes the "open web page" component and executes the action of clicking the "Chrome icon" (i.e., the icon moves to < box > (x 1, y 1), (x 2, y 2) </box > and double-clicks) and the corresponding web site input action (i.e., fills in the web site "www.taobao.com" and returns) to complete the action of opening the panning website in the Chrome browser.

2.3. Repeating the step 2.2, and executing the rest actions.

3. The service description of searching for a man coat in the Beijing Dong network, if the man coat needs to be logged in, the login account number is 123, the password is abc, and the intelligent system is input into an RPA intelligent system and operates according to the following steps:

3.1. firstly, an action sequence of opening a Beijing dong website- > clicking a login button- > inputting an account 123, clicking a login- > clicking a search box- > inputting a ' man coat- > clicking a ' search ' is obtained through a ' business process disassembly module '. It can be seen that the large language model in the business process disassembly module automatically comprises the relevant actions of 'account login' when the large language model senses that the user needs to search in the Beijing east website.

3.2. Similar to the 2.2 execution process, the whole automated process execution is completed.

4. The business description of ' in spelling multiple searching men's coat, if need to log in, the login account is 123, the password is abc ' is input into the RPA intelligent system, the intelligent system operates according to the following steps:

4.1. firstly, an action sequence of opening a multi-website- > clicking a login button- > inputting an account 123, clicking a login- > clicking a search box- > inputting a man coat- > and clicking a search is obtained through a business process disassembly module. It can be seen that the large language model in the business process disassembly module automatically contains the relevant action of 'account login' when the large language model senses that the multi-website is to be searched.

4.2. After the action is performed in the manner described in 2.2 and the click login is performed, as shown in fig. 9, the website pops up a slider verification code page, so that the next action of clicking the search box fails to be performed. The page state understanding and target positioning module fails to position the information element, the page is a verification code verification page, and the verification code type is a sliding block verification code and is fed back to the large language model. The large language model then adjusts the follow-up actions to "slider verification code verification- > -click search box- > -enter" men coat "- >" click "search".

4.3. And executing the action of 'sliding block verification code verification', and 'the action execution module' calls the sliding block verification code component to complete verification operation and enter a search page. The procedure is then performed in analogy to 2.2, completing the remaining automation operations.

5. The steps show the construction example of the RPA intelligent agent system, and take the tasks of 'Taobao net, beijing Dong net, and Ping Duoduo search man coat' as examples, and show the self-adaptive instruction dismantling capability based on scene understanding, the element positioning capability based on page understanding and the action sequence adjusting capability based on page state understanding of the system.

The invention creatively provides an RPA flow execution system based on page state understanding and large model driving. The large language model, the multi-mode large model and the RPA system are deeply fused, and the RPA intelligent body can complete process disassembly, page identification and action execution only by describing own service requirements in detail through natural language. Moreover, by means of the powerful understanding and reasoning capability of the large language model, the established execution state and page state feedback mechanism enables the system to dynamically adjust the disassembled execution plan according to the page state and the execution target. Therefore, the threshold of the RPA user can be greatly reduced, the usability and convenience of the RPA system can be greatly improved, abnormal conditions in the process of executing the flow can be dealt with, and the stability of executing the flow can be greatly improved.

The innovation points of the invention are as follows:

1. the invention greatly reduces the threshold of the RPA user and greatly improves the usability and stability of the RPA system by deeply fusing the capability of the large model technology (comprising a large language model and a multi-mode large model) with the RPA system.

2. The invention creatively provides a visual-language multi-mode large model for 'business process instruction disassembly' large language model and 'computer page understanding and target positioning', and effectively supports task planning and page recognition capability of an RPA intelligent system.

3. The invention creatively provides an RPA flow execution feedback mechanism based on page state understanding, so that an RPA system can monitor the flow execution state in real time on the basis of planned execution steps and feed back the current page state, thereby adaptively adjusting subsequent execution actions and greatly enhancing the anti-interference capability of the RPA system.

The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.

Claims

1. The RPA flow execution system based on page state understanding and large model driving is characterized by comprising:

2. The RPA process execution system based on page state understanding and large model driven according to claim 1, wherein the business process disassembly module comprises a large language model; the large language model is trained by a general large language model in a corpus containing data samples of business instructions and flow steps.

3. The page state understanding and large model driven RPA flow execution system of claim 2, wherein the generic large language model comprises ChatGPT.

4. The RPA process execution system based on page state understanding and large model driving of claim 1, wherein the page state understanding and targeting module comprises a vision-language multi-modal large model for understanding and identifying computer pages; the visual-language multi-mode big model is obtained by training a plurality of web pages and image data samples of an application software interface by a basic multi-mode big model;

5. The page state understanding and large model driven RPA flow execution system of claim 4, wherein the multi-modal large model comprises a Kosmos-2 model and a Qwen-vl model.

6. The RPA process execution system based on page status understanding and large model driving according to claim 2, wherein in the page status understanding and target positioning module, the process of describing page contents and positioning target elements specifically comprises:

7. The RPA flow execution system based on page state understanding and large model driven according to claim 1, wherein in the action execution module, the execution of the corresponding operation actions by component call includes mouse click and keyboard input.

8. The RPA flow execution method based on page state understanding and large model driving is applied to the RPA flow execution system based on page state understanding and large model driving described in any one of claims 1-7, and is characterized by comprising the following steps of;