CN112182140A

CN112182140A - Information input method and device combining RPA and AI, computer equipment and medium

Info

Publication number: CN112182140A
Application number: CN202010825399.2A
Authority: CN
Inventors: 胡一川; 汪冠春; 褚瑞; 李玮; 唐梓毅
Original assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Current assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2021-01-05
Anticipated expiration: 2040-08-17
Also published as: CN112182140B

Abstract

The embodiment of the application discloses an information input method, device, equipment and medium combining RPA and AI. Wherein, the method comprises the following steps: the RPA system acquires a character recognition result of a contract picture to be processed, performs word segmentation on company name information in the character recognition result to obtain a word segmentation sequence of the company name information, performs segmentation processing on the word segmentation sequence to obtain a plurality of segments, performs retrieval by combining a preset inverted index to obtain a plurality of candidate company name entries at least matched with two segments, acquires target company name information most similar to the company name information from the candidate company name entries, and automatically inputs the contract information of the target company name information and the company address information corresponding to the target company name information. Therefore, automatic information input is realized based on the RPA and AI technologies, the labor cost can be greatly reduced, the information input efficiency is effectively improved, and the information input is accurately and quickly completed.

Description

Information input method and device combining RPA and AI, computer equipment and medium

Technical Field

The present application relates to the technical field of Artificial Intelligence, and in particular, to a method, an apparatus, a computer device, and a medium for inputting information for combining an RPA (robot Process Automation) and an AI (Artificial Intelligence) in combination with the RPA and the AI.

Background

Robot Process Automation (RPA) simulates the operation of a human on a computer through specific robot software and automatically executes Process tasks according to rules. Artificial Intelligence (AI) is a technical science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence.

At present, in a scenario of managing the agreement text, in the related art, generally, a contract image is scanned by a character recognition technology, and contract information such as company name information extracted from the contract image is checked manually based on the contract image, and then correct company name information is entered into a contract information entry table, so as to realize electronic management of the contract information. However, in this way, on one hand, in terms of workload of manual entry confirmation, a confirmation error may occur in the confirmation process, so that the entered information is inaccurate, and on the other hand, the labor cost is high, and the information management efficiency is low.

Disclosure of Invention

The embodiment of the application discloses an information input method, an information input device, computer equipment and a medium which are combined with RPA and AI, so that the trouble of manual input is reduced, and the improvement is realized.

In a first aspect, an embodiment of the present application discloses an information entry method combining an RPA and an AI, including: the method comprises the steps that an RPA system obtains a character recognition result of a contract picture to be processed, wherein the character recognition result comprises company name information; the RPA system carries out word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information; dividing every adjacent N participles in the participle sequence into one segment by the RPA system to obtain a plurality of segments of the company name information, wherein N is an integer greater than 1; the RPA system searches each segment of participles based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, wherein the first candidate company name entry set comprises a plurality of first candidate company name entries, and at least two participles of the first candidate company name entries are matched with the corresponding segment of participles; the RPA system acquires target company name information with highest text similarity with the company name information according to the first candidate company name entries; and the RPA system automatically inputs the contract information input table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

In an embodiment of the present application, the obtaining, by the RPA system, target company name information with a highest text similarity to the company name information according to a plurality of first candidate company name entries includes:

the RPA system determines a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each word segmentation, wherein the second candidate company name entries comprise at least one second candidate company name entry, and the number of the words segmented by the second candidate company name entry is the most matched with the word segmentation sequence; and the RPA system acquires the second candidate company name entry with the highest text similarity as the target company name information according to the text similarity between the company name information and each second candidate company name entry.

In an embodiment of the present application, the performing, by the RPA system, word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information includes: the RPA system carries out word segmentation processing on the company name information to obtain a single word sequence of the company name information; the RPA system carries out word segmentation processing on the company name information to obtain a word dimension sequence of the company sequence; the RPA system acquires each phrase in the word dimension sequence; the RPA system deletes the single characters corresponding to the phrases in the single character sequence aiming at each phrase to obtain a processed single character sequence; and the RPA system generates a word segmentation sequence of the company name information according to the word group and the processed single word sequence.

In one embodiment of the present application, the RPA system determines a second set of candidate company name terms for the company name information according to a plurality of the first candidate company name terms for each segment of the segmented word, including: the RPA system acquires the number of participles of each first candidate company name entry matched with the participles in the participle sequence according to the plurality of first candidate company name entries of each participle; and the RPA system selects a first candidate company name entry with the word segmentation quantity meeting a preset condition from a plurality of first candidate company name entries to generate a second candidate company name entry of the company name information.

In an embodiment of the application, the acquiring, by the RPA system according to the text similarity between the company name information and each of the second candidate company name entries, a second candidate company name entry with a highest text similarity as the target company name information includes: the RPA system determines, for each of the second candidate company name terms, a minimum edit distance between the company name information and the second candidate company name term; and the RPA system selects a second candidate company name entry with the minimum editing distance from a plurality of second candidate company name entries as the target company name information.

In an embodiment of the application, the RPA system selects a first candidate company name entry with a word segmentation quantity meeting a preset condition from a plurality of first candidate company name entries to generate a second candidate company name entry of the company name information, including: the RPA system sorts a plurality of first candidate company name entries according to the sequence of the word segmentation quantity from large to small so as to obtain a sorting result; and the RPA system selects M first candidate participle name information ranked at the top from the ranking result to generate a second candidate company name entry of the company name information, wherein M is an integer greater than or equal to 1.

In a second aspect, an embodiment of the present application provides an information entry apparatus combining an RPA and an AI, where the apparatus is applied to an RPA system, and the apparatus includes: the first acquisition module is used for acquiring a character recognition result of the contract picture to be processed, wherein the character recognition result comprises company name information; the word segmentation module is used for carrying out word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information; the segmentation module is used for dividing every adjacent N participles in the participle sequence into one segment so as to obtain a plurality of segments of the company name information, wherein N is an integer greater than 1; the retrieval module is used for retrieving each segment of participles based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, wherein the first candidate company name entry set comprises a plurality of first candidate company name entries, and at least two participles of the first candidate company name entries are matched with the corresponding segments of the participles; the second acquisition module is used for acquiring target company name information with highest text similarity with the company name information according to the first candidate company name entries; and the information input module is used for automatically inputting information into the contract information table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

In an embodiment of the application, the second obtaining module includes a determining sub-module, configured to determine a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each segment of participles, where the second candidate company name entry includes at least one second candidate company name entry, and the second candidate company name entry has a largest number of participles matched with the participle sequence; and the obtaining submodule is used for obtaining the second candidate company name entry with the highest text similarity as the target company name information according to the text similarity between the company name information and each second candidate company name entry.

In a third aspect, an embodiment of the present application further discloses a computing device, including: a memory storing executable program code; a processor coupled with the memory; the processor calls the executable program code stored in the memory to execute the information entry method combining the RPA and the AI provided by any embodiment of the application.

In a fourth aspect, the present application further provides a computer-readable storage medium storing a computer program, where the computer program includes a program for executing the information entry method combining the RPA and the AI provided in any embodiment of the present application.

According to the technical scheme provided by the embodiment, an RPA system obtains a character recognition result of a contract picture to be processed, performs word segmentation on company name information in the character recognition result to obtain a word segmentation sequence of the company name information, then performs segmentation on the word segmentation sequence to obtain a plurality of segments, performs retrieval by combining a preset inverted index to obtain a plurality of candidate company name entries at least matched with two segments, obtains target company name information most similar to the company name information in the character recognition result from the candidate company name entries to take the target company name information as a company name of an original contract text in the contract picture, and automatically inputs the target company name information and company address information corresponding to the target company name information into complete contract information. Therefore, automatic information input is realized based on the RPA and AI technologies, the labor cost can be greatly reduced, the information input efficiency is effectively improved, and the information input is accurately and quickly completed.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of an information entry method combining an RPA and an AI according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another information entry method combining an RPA and an AI according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another information entry method combining an RPA and an AI according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of information entry in conjunction with an RPA and an AI according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an information entry device combining an RPA and an AI according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another information entry device combining an RPA and an AI according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In the description of the present application, it is to be understood that the term "plurality" means two or more; the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Information entry methods, apparatuses, computer devices, and storage media in conjunction with RPA and AI are described below with reference to specific embodiments.

Fig. 1 is a schematic flowchart of an information entry method combining an RPA and an AI according to an embodiment of the present disclosure, where the information entry method combining an RPA and an AI according to this embodiment may be executed by an RPA system, the RPA system may be implemented by software and/or hardware, the RPA system may be configured in an electronic device or a server, and this embodiment is not particularly limited in this respect. The electronic device may include a hardware device such as a smart phone, a personal computer, and a portable device.

As shown in fig. 1, the method includes:

step 1, the RPA system acquires a character recognition result of a contract picture to be processed, wherein the character recognition result comprises company name information.

Specifically, when information on the contract picture needs to be recorded, the RPA system in this embodiment may automatically obtain the contract picture to be processed, and recognize the contract picture by using an Optical Character Recognition (OCR) technology, so as to obtain a text Recognition result of the contract picture to be processed.

And 2, the RPA system carries out word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information.

In an embodiment of the present application, in order to obtain the corresponding company name information accurately and improve the efficiency of subsequent retrieval at the same time, the RPA system of this embodiment may perform word segmentation processing on the company name by using the mixed-granularity word, so as to obtain a word segmentation sequence of the company name information. As a possible implementation manner, the RPA system may perform word segmentation processing on the company name information through multiple word segmentation granularities, and determine a word segmentation sequence of the company name information by combining word segmentation results of each word segmentation granularity.

The word segmentation granularity can include word granularity and word granularity.

As a possible implementation manner, the specific implementation manner of the RPA system performing the word segmentation processing on the company name information to obtain the word segmentation sequence of the company name information may be: the RPA system carries out single-word segmentation processing on the company name information to obtain a single-word sequence of the company name information; the RPA system carries out word segmentation processing on the company name information to obtain a word dimension sequence of the company sequence; the RPA system acquires each phrase in the word dimension sequence; the RPA system deletes the single characters corresponding to the word groups in the word sequence aiming at each word group to obtain the processed single character sequence; and the RPA system generates a word segmentation sequence of the company name information according to the word group and the processed single word sequence.

For example, the word recognition is performed on the same picture, the obtained company name information is "beijing certain network material throwing limited company", the word sequence corresponding to the company name information is { ' north ', ' beijing ', ' certain ', ' network ', ' material ', ' throwing ', ' limited ', ' public ', ' department ' }, the word dimension sequence corresponding to the company name information is assumed to be { ' beijing ', ' certain ', ' network ', ' material ', ' throwing ', ' limited company ', the word group in the word dimension sequence can be obtained, the single word corresponding to the word group in the word sequence can be deleted, the word sequence obtained after the processing is { ' certain ', ' material ', ' throwing ', }, and then the obtained word sequence and the processed word sequence are combined to obtain the participle sequence corresponding to the company name information, the participle sequence is { ' beijing ', ' material ', ' throwing ', ' obtained after the processing, 'some', 'network', 'material', 'projection', 'Limited' }.

As another possible implementation manner, the specific implementation manner of the RPA system performing the word segmentation processing on the company name information to obtain the word segmentation sequence of the company name information may be: the RPA system carries out word segmentation processing on the company name information to obtain a word dimension sequence of the company name information, wherein the word dimension sequence comprises single words and/or phrases; correspondingly, when detecting that the word dimension sequence contains a word group, the RPA system can acquire the word group in the word dimension sequence and acquire the occurrence frequency of the word group, if the occurrence frequency of the word group exceeds a preset frequency threshold, the word group in the word dimension sequence is kept unchanged, and if the occurrence frequency of the word group is less than the preset frequency threshold, the word division processing is performed on the word group in the word dimension sequence to obtain the word division sequence of the company name information.

The preset frequency threshold is a preset occurrence frequency threshold, and the preset frequency threshold can be determined by analyzing the occurrence condition of a large amount of phrases in a company name entry library.

In the embodiment, the low-frequency phrases in the word dimension sequence of the company name information are continuously divided, and the high-frequency phrases in the word dimension sequence of the company name information are kept unchanged, so that the retrieval efficiency can be improved under the condition of ensuring that the expected company name entry is accurately recalled.

For example, by performing character recognition on the same picture, the obtained company name information is "beijing, a certain network material input limited company", and assuming that the word dimension sequence corresponding to the company name information is { 'beijing', 'certain', 'network', 'material input', 'limited company' }, by performing frequency analysis on the phrases in the word dimension sequence, if the occurrence frequencies of 'Beijing', 'network', 'company Limited' in the word dimension sequence are all determined to be greater than the preset frequency threshold according to the analysis result, only the occurrence frequency of the word group 'material feeding' is less than the preset frequency threshold, at this time, the word segmentation processing can be performed on the 'material casting' in the word dimension sequence to obtain a final word segmentation sequence { 'Beijing', 'some', 'network', 'material', 'casting', 'Limited' } corresponding to the company name information.

And 3, dividing every adjacent N participles in the participle sequence into one segment by the RPA system to obtain multiple segments of participles of the company name information, wherein N is an integer greater than 1.

N is preset, for example, N may be 2, 3, or 4, and in actual application, the size of N may be set according to actual service requirements.

For example, when the same picture is subjected to character recognition, the obtained company name information is 'can be' to the network material casting limited company ', the word segmentation sequence corresponding to the company name information is {' can ',' shadow ',' network ',' material ',' casting ',' limited company ',' N is 4, and the word segmentation sequence can be divided into two sections: "can", "Jing", "running", "shadow", "network", "material", "projection", "Limited", "etc.

And 4, retrieving each segment of participles by the RPA system based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, wherein the first candidate company name entry set comprises a plurality of first candidate company name entries, and at least two participles of the first candidate company name entry and the corresponding segment of participles are matched.

The inverted index library stores the association relationship between each participle (token) and the identification of each company name entry.

The identification of the company name information is used for identifying the company name information, and different company name information corresponds to different identifications.

It is understood that, in order to improve the accuracy of the search, the inverted index database may be updated, and as an example, the inverted index database may be updated based on the updated address information database every preset time, for example, the inverted index database may be updated based on the address information database every three months or half a year. The address information base stores the association relationship between the entries of the company institutions (i.e. the company name entries) and the company addresses.

In an embodiment of the present application, in order to consider recall efficiency and memory usage, the inverted index database in this embodiment may store related data in the inverted index database in a manner of a data structure of a RoaringBitmap.

The Roaringbitmap is a compressed bitmap data structure, and can effectively improve the memory use efficiency of the bitmap.

And 5, the RPA system acquires the target company name information with the highest text similarity with the company name information according to the first candidate company name entries.

And 6, the RPA system automatically inputs the information into a contract information input table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

Specifically, after the RPA system acquires the name information of the target company, the RPA system may acquire the company address information matching the name information of the target company from the national company and institution information base, and then, the RPA system automatically enters the name information of the target company and the company address information corresponding to the name information of the target company into the contract information.

It can be understood that although the OCR technology is mature, it cannot guarantee 100% of accurate recognition, in order to accurately input the company name information of the original contract text in the contract picture into the contract information input table, the RPA system of this embodiment, after obtaining the company name information in the text recognition result, does not perform information input in combination with the company name information, but performs segmentation on the division in the text recognition result by segmenting the division in the text recognition result, and performs retrieval in combination with the segmentation division and a preset inverted index library to obtain a plurality of candidate company name entries that are matched with at least two divisions of the corresponding division, then selects the target company name information that is most similar to the company name information in the text recognition result from the plurality of candidate company name entries, and then combines the target company name information and the company address information corresponding to the target company name information, and automatically inputting information into a contract information input table corresponding to the contract pictures to be processed. Therefore, the target company name information is used as the company name information in the contract original text, so that the occurrence of the error condition of company name information entry caused by character recognition errors can be avoided, and meanwhile, in the process of entering the company name information, any manual operation is not needed, the labor cost can be greatly reduced, the information entry efficiency is effectively improved, and the information entry can be accurately and quickly completed.

According to the information input method combining the RPA and the AI, the RPA system obtains a character recognition result of a contract picture to be processed, performs word segmentation processing on company name information in the character recognition result to obtain a word segmentation sequence of the company name information, then performs segmentation processing on the word segmentation sequence to obtain a plurality of segments, performs retrieval by combining a preset inverted index to obtain a plurality of candidate company name entries at least matched with two segments, obtains target company name information most similar to the company name information in the character recognition result from the candidate company name entries, takes the target company name information as a contract original company name in the contract picture, and automatically inputs the target company name information and company address information corresponding to the target company name information. Therefore, automatic information input is realized based on the RPA and AI technologies, the labor cost can be greatly reduced, the information input efficiency is effectively improved, and the information input is accurately and quickly completed.

In an embodiment of the present application, there are many candidate company name entries that are usually matched to at least two segments, and in order to quickly find out the target company name information that meets the requirement while reducing the amount of computation, as shown in fig. 2, the above-mentioned RPA system may obtain the target company name information with the highest text similarity to the company name information according to a plurality of first candidate company name entries, including:

and step 51, the RPA system determines a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each segment of participles, wherein the second candidate company name entries comprise at least one second candidate company name entry, and the number of the participles matched with the participle sequence by the second candidate company name entry is the largest.

In this embodiment, one possible implementation manner of the RPA system determining the second candidate company name entry set of the company name information according to the plurality of first candidate company name entries of each segment of the segmented word may be: the RPA system acquires the number of participles of each first candidate company name entry matched with the participles in the participle sequence according to the plurality of first candidate company name entries of each participle; the RPA system selects a first candidate company name entry with the word segmentation quantity meeting a preset condition from a plurality of first candidate company name entries to generate a second candidate company name entry of the company name information.

The preset condition is preset, for example, the preset condition may be that the number of segmented words is sorted from large to small, and the sorting is performed at the top M, or the preset condition may be that the number of segmented words is the largest, or the preset condition may also be that the number of segmented words exceeds a preset threshold, and the like.

As an exemplary embodiment, the RPA system selects, from a plurality of first candidate company name entries, a first candidate company name entry whose number of segments meets a preset condition, and generates a second candidate company name entry of the company name information by using a specific implementation manner that: and the RPA system sorts the first candidate company name entries according to the number of the participles from large to small to obtain a corresponding sorting result, then selects M candidate company name entries sorted at the front from the sorting result, and generates a second candidate company name entry set based on the selected candidate company name entries.

Where M is preset, for example, M may be 100, 200, or the like, and this embodiment is not particularly limited thereto.

For example, after segmenting the segmentation word sequence, assuming that k segments of segmentation words are obtained, the reverse index may be performed based on each segment of segmentation words to obtain candidate company name entries having at least two or more segmentation words matched with the corresponding segment of segmentation words, and then, according to a first candidate company name entry set of each segment of segmentation words, a first candidate company name entry hitting 2 to 2k segments of segmentation words (tokens) may be obtained. Further, the first candidate company name entry with 3 or 4 tokens hit in any one segment and 6 to 8 tokens hit in any two adjacent segments can be calculated, so that the first candidate company name entry with 2 to 2k +4 tokens can be further obtained. Then, the first candidate company name entries are sorted from large to small according to the number of the matched participles, and the candidate company name entries sorted at the top 200 are selected from the first candidate company name entries to form a second candidate company name entry set.

And step 52, the RPA system acquires the second candidate company name entry with the highest text similarity as the target company name information according to the text similarity between the company name information and each second candidate company name entry.

In this embodiment, after obtaining the second candidate company name entry, the RPA system performs more complex calculation on the second candidate company name entry and the company name information in combination with the fact that the participle is consecutive in the word order, and further ranks the second candidate company name entry to obtain the company name information with higher matching degree.

Based on the above embodiment, in order to obtain the matched target company name information accurately and reduce the calculation amount, the target company name information most similar to the company name information may be selected from the second candidate company name terms in combination with the minimum encoding distance between the second candidate company name terms and the company name information, which is further described below with reference to fig. 3. As shown in fig. 3, the RPA system obtains, according to the text similarity between the company name information and each second candidate company name entry, the second candidate company name entry with the highest text similarity as the target company name information, and includes:

at step 521, the RPA system determines, for each second candidate company name entry, a minimum edit distance between the company name information and the second candidate company name entry.

In this embodiment, the minimum edit distance between the company name information and the second candidate company name entry may be calculated by a preset edit distance algorithm. It should be noted that, by using the edit distance algorithm, the calculation of the minimum edit distance between the company name information and the second candidate company name entry can be completed within several milliseconds.

In step 522, the RPA system selects a second candidate company name entry with the smallest minimum edit distance from the plurality of second candidate company name entries as the target company name information.

Specifically, the RPA system sorts the second candidate company name entries according to the sequence of the minimum edit distance from small to large to generate a sorting result; and the RPA system selects a second candidate company name entry ranked at the first position from the ranking result as the target company name information.

In order to make the information entry method combining the RPA and the AI of the present application clearly understood by those skilled in the art, the information entry method combining the RPA and the AI of the present embodiment is further described below with reference to fig. 4. It should be noted that fig. 4 is a schematic flow chart illustrating information entry in combination with the RPA and the AI, and fig. 4 exemplifies that the obtained company name information is "can be jingbing projection limited company" by performing character recognition on a common picture.

As can be seen from fig. 4, the present embodiment is mainly divided into two parts, an offline part and an online part.

In the off-line part, the method is mainly divided into 3 steps;

1. and constructing an address information base.

The address information base stores the corresponding relation between the entries of the company institutions and the company address information.

2. The present invention is a method for performing mixed-granularity word segmentation based on entries of company organizations in an address information base, wherein word granularity word segmentation is mainly used in the present scheme, and word granularity is adopted for some high-frequency systems in the entries, so that the computation amount for matching large high-frequency words can be reduced, for example, if a word of "limited company" appears in a 300W entry, many collective operations can be reduced if the word hits as one word, and the search efficiency is improved.

In this embodiment, in order to avoid that the high frequency word is recognized incorrectly by the character recognition result, and to ensure that the index is hit by the corresponding character even when the recognition is incorrect, an index may be constructed for each individual character corresponding to the high frequency word. For example, when the index is constructed, 4 independent words such as "limited", "public" and "department" are also added into the index, so as to ensure that the index can be hit by the corresponding word when the recognition is wrong.

3. And establishing an inverted index.

The method comprises the steps of taking identification information of entries of a company structure with each participle as a set, constructing an inverted index, wherein a common data structure comprises a linked list, a set and a bitmap, and the scheme adopts Roaringbitmap and is a data structure which can give consideration to recall efficiency and memory occupation. It should be noted that, in order to improve recall efficiency, the word order information is also considered, and a combination of adjacent 2 words is also added when the index is built.

The on-line part mainly comprises 6 steps:

1. mixed particle size word segmentation:

on-line word segmentation also adopts mixed granularity, but individual words corresponding to the word are not added, for example, if the word hits Beijing, the word does not independently search Beijing and Beijing.

2. And (3) querying candidates by inverted indexes:

and searching an id set corresponding to each token from the index by using the token after word segmentation.

3. And (3) recall of segmentation fuzzy matching:

the purpose of this step is to quickly find candidates that meet the requirements, typically on the scale of tens of thousands.

The step is actually a core part of the comparison of the whole scheme, not only address entries with high matching degree with the query are required to be quickly screened, but also the recall rate is ensured to be high enough, namely the matching condition is too tight, the number of recalled entries is too small, and correct results are easy to miss; the matching condition is too loose, and too many candidates are left, which does not play a role in reducing the operation.

The index recall strategy adopted by the scheme is as follows:

1. the query is segmented, one segment for each adjacent 4 tokens, assuming a segmentation into k segments.

For example: 'Nengben network Material feeding Limited' is divided into two sections after word segmentation:

[ 'energy', 'Beijing', 'running', 'shadow' ] [ 'network', 'material', 'projection', 'Limited' ]

2. And performing set operation to obtain any section of entries hitting more than 2 tokens.

For example: for the first segment, we look up the corresponding sets C _ Neen, C _ Jing, C _ Ben, C _ shadow from the inverted index,

by means of a set operation:

(C _ can ≧ C _ Beijing) ' U (C _ can ≧ C _ Ben) ' U (C _ can ≧ C _ shadow) ' U (C _ Beijing ≧ C _ Ben) ' U (C _ Beijing ≧ C _ shadow) ' U (C _ Ben ≧ C _ shadow)

Entries that hit at least 2 tokens can be obtained.

By the strategy, on one hand, correct entries can be recalled even if 2 characters of OCR recognition errors exist in 4-character phrases; on the other hand, the number of terms hitting 2 adjacent tokens is far less than that of only hitting 1 index, and the balance between recalling and accuracy is realized.

Thanks to the excellent performance of the RoaringBitmap, a single set operation is only a few microseconds, so that the candidate set size can be rapidly reduced (candidates are rapidly narrowed from hundreds of thousands to tens of thousands within 1 millisecond generally).

In addition, since the offline indexing stage stores the two-word combination index of C _ Energy ≧ C _ Jing in advance, the actual computation time is further shortened.

4. Coarse sorting according to the number of matched tokens:

in the step, by means of set operation, coarse sorting of the candidate set is completed, and candidates are rapidly reduced from tens of thousands to hundreds.

The strategy is as follows:

1. in the last step, at least two sections of entries hit more than 2 tokens are found, and then the entries hit more than 2 tokens in each section are found, so that the entries hit 2 to 2k tokens are obtained

2. Further, a set of entries with 3 or 4 tokens hit in any one segment and 6 to 8 tokens hit in any two adjacent segments can be calculated, so that entries hitting 2 to 2k +4 tokens can be further obtained.

3. And finally, according to the sequence of the number of the hit participles from large to small, sequencing the candidate company name entries, selecting top TopN (generally 200) entries sequenced at the front, and performing next matching.

5. Query TopN id's corresponding text:

all steps up to this step use only the id set of the entries in the index. The purpose of this step is to find the text and other information corresponding to the candidate id from the database for further sorting.

The advantage of this is that it is not necessary to load all text information into the memory, but only to read a few hundred samples specified when needed.

6. And (3) precise sequencing:

in the previous rough recall stage, only candidates of more public tokens which are possibly compared with the target text are found, and whether the tokens are consecutive in the word order is not considered. Therefore, more complex calculation needs to be performed on the candidate texts and the query, and the candidates are further ranked to obtain the name information of the target company with higher matching degree.

Aiming at the application scene, the step has a plurality of algorithm candidates, the scheme adopts an edit distance algorithm, and the calculation of 200 candidates can be completed within a few milliseconds.

7. And matching the company address information corresponding to the target company name information in combination with the target company name information, and then, performing contract information entry according to the target company name information and the matched company address information.

In order to implement the above embodiments, the present application further provides an information entry device combining an RPA and an AI.

Fig. 5 is a schematic structural diagram of an information entry device combining an RPA and an AI according to an embodiment of the present application.

As shown in fig. 5, the information entry device 10 combining RPA and AI can be applied to an RPA system, and the device 10 includes:

the first obtaining module 100 is configured to obtain a text recognition result of the contract picture to be processed, where the text recognition result includes company name information.

And the word segmentation module 200 is configured to perform word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information.

The segmenting module 300 is configured to divide every adjacent N segmented words in the segmented word sequence into one segment to obtain multiple segments of segmented words of the company name information, where N is an integer greater than 1.

The retrieval module 400 is configured to retrieve each segment of participles based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, where the first candidate company name entry set includes a plurality of first candidate company name entries, and at least two participles of the first candidate company name entry and the corresponding segment of participles are matched.

The second obtaining module 500 is configured to obtain, according to the multiple first candidate company name entries, target company name information with a highest text similarity to the company name information.

And the information entry module 600 is configured to perform automatic information entry on the contract information table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

In an embodiment of the present application, based on the embodiment of the apparatus shown in fig. 5, as shown in fig. 6, the second obtaining module 500 includes:

the determining sub-module 510 is configured to determine a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each segment of the participles, where the second candidate company name entry includes at least one second candidate company name entry, and the second candidate company name entry has a largest number of participles matched with the participle sequence.

The obtaining sub-module 520 is configured to obtain, according to the text similarity between the company name information and each second candidate company name entry, the second candidate company name entry with the highest text similarity as the target company name information.

In an embodiment of the present application, the word segmentation module 200 is specifically configured to: the RPA system carries out single-word segmentation processing on the company name information to obtain a single-word sequence of the company name information; the RPA system carries out word segmentation processing on the company name information to obtain a word dimension sequence of the company sequence; the RPA system acquires each phrase in the word dimension sequence; the RPA system deletes the single characters corresponding to the word groups in the word sequence aiming at each word group to obtain the processed single character sequence; and the RPA system generates a word segmentation sequence of the company name information according to the word group and the processed single word sequence.

In an embodiment of the present application, the determining sub-module 510 is specifically configured to: obtaining the number of participles of each first candidate company name entry matched with the participles in the participle sequence according to the plurality of first candidate company name entries of each participle; and selecting the first candidate company name entry with the segmentation quantity meeting the preset condition from the plurality of first candidate company name entries to generate a second candidate company name entry of the company name information.

In an embodiment of the present application, the obtaining sub-module 520 is specifically configured to: determining a minimum edit distance between the company name information and the second candidate company name entry for each second candidate company name entry; selecting a second candidate company name entry having a smallest minimum edit distance from the plurality of second candidate company name entries as the target company name information.

In one embodiment of the present application, selecting a first candidate company name entry having a number of segmented words satisfying a preset condition from a plurality of first candidate company name entries to generate a second candidate company name entry of company name information, includes: sequencing a plurality of first candidate company name entries according to the sequence of the number of the participles from large to small to obtain a sequencing result; and selecting M first candidate participle name information ranked at the top from the ranking result to generate a second candidate company name entry of the company name information, wherein M is an integer greater than or equal to 1.

It should be noted that the foregoing explanation on the method embodiment is also applicable to the apparatus of this embodiment, and is not repeated herein.

The information entry device of this application embodiment combines RPA and AI.

In order to implement the above embodiments, the present application also provides a computer device.

FIG. 7 is a block diagram of a computer device according to one embodiment of the present application. As shown in fig. 7, a memory 21, a processor 22, and a computer program stored on the memory 21 and executable on the processor 22.

The processor 22, when executing the program, implements the information entry method in conjunction with the RPA and the AI provided in the above-described embodiments.

Further, the computer device further comprises:

a communication interface 23 for communication between the memory 21 and the processor 22.

A memory 21 for storing a computer program operable on the processor 22.

The memory 21 may comprise a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

And the processor 22 is used for implementing the information entry method combining the RPA and the AI of the embodiment when executing the program.

If the memory 21, the processor 22 and the communication interface 23 are implemented independently, the communication interface 21, the memory 21 and the processor 22 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 21, the processor 22 and the communication interface 23 are integrated on a chip, the memory 21, the processor 22 and the communication interface 23 may complete mutual communication through an internal interface.

The processor 22 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.

According to the information input method combining the RPA and the AI, the matching information can be searched by utilizing the first text for positioning, model training of character recognition is not needed, and the cost of image recognition is effectively reduced while the recognition effect is ensured.

In order to implement the above embodiments, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the information entry method combining the RPA and the AI of the foregoing method embodiments is implemented.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An information entry method combining an RPA and an AI, comprising:

the method comprises the steps that an RPA system obtains a character recognition result of a contract picture to be processed, wherein the character recognition result comprises company name information;

the RPA system carries out word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information;

dividing every adjacent N participles in the participle sequence into one segment by the RPA system to obtain a plurality of segments of the company name information, wherein N is an integer greater than 1;

the RPA system searches each segment of participles based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, wherein the first candidate company name entry set comprises a plurality of first candidate company name entries, and at least two participles of the first candidate company name entries are matched with the corresponding segment of participles;

the RPA system acquires target company name information with highest text similarity with the company name information according to the first candidate company name entries;

and the RPA system automatically inputs the contract information input table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

2. The method of claim 1, wherein said RPA system obtaining target company name information having a highest text similarity to said company name information based on a plurality of said first candidate company name terms, comprises:

the RPA system determines a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each word segmentation, wherein the second candidate company name entries comprise at least one second candidate company name entry, and the number of the words segmented by the second candidate company name entry is the most matched with the word segmentation sequence;

and the RPA system acquires the second candidate company name entry with the highest text similarity as the target company name information according to the text similarity between the company name information and each second candidate company name entry.

3. The method of claim 1, wherein the RPA system performs a word segmentation process on the company name information to obtain a word segmentation sequence of the company name information, comprising:

the RPA system carries out word segmentation processing on the company name information to obtain a single word sequence of the company name information;

the RPA system carries out word segmentation processing on the company name information to obtain a word dimension sequence of the company sequence;

the RPA system acquires each phrase in the word dimension sequence;

the RPA system deletes the single characters corresponding to the phrases in the single character sequence aiming at each phrase to obtain a processed single character sequence;

and the RPA system generates a word segmentation sequence of the company name information according to the word group and the processed single word sequence.

4. The method of claim 2, wherein said RPA system determining a second set of candidate company name terms for said company name information from a plurality of said first candidate company name terms for each segment of a segmented word comprises:

the RPA system acquires the number of participles of each first candidate company name entry matched with the participles in the participle sequence according to the plurality of first candidate company name entries of each participle;

and the RPA system selects a first candidate company name entry with the word segmentation quantity meeting a preset condition from a plurality of first candidate company name entries to generate a second candidate company name entry of the company name information.

5. The method of claim 2, wherein the RPA system obtaining a second candidate company name entry with highest text similarity as the target company name information according to the text similarity between the company name information and each of the second candidate company name entries comprises:

the RPA system determines, for each of the second candidate company name terms, a minimum edit distance between the company name information and the second candidate company name term;

and the RPA system selects a second candidate company name entry with the minimum editing distance from a plurality of second candidate company name entries as the target company name information.

6. The method of claim 4, wherein the RPA system selects a first candidate company name entry having a number of segments satisfying a predetermined condition from a plurality of first candidate company name entries to generate a second candidate company name entry of the company name information, comprising:

the RPA system sorts a plurality of first candidate company name entries according to the sequence of the word segmentation quantity from large to small so as to obtain a sorting result;

and the RPA system selects M first candidate participle name information ranked at the top from the ranking result to generate a second candidate company name entry of the company name information, wherein M is an integer greater than or equal to 1.

7. An information entry device combining an RPA and an AI, wherein the device is applied to an RPA system, the device comprises:

the first acquisition module is used for acquiring a character recognition result of the contract picture to be processed, wherein the character recognition result comprises company name information;

the word segmentation module is used for carrying out word segmentation processing on the company name information to obtain a word segmentation sequence of the company name information;

the segmentation module is used for dividing every adjacent N participles in the participle sequence into one segment so as to obtain a plurality of segments of the company name information, wherein N is an integer greater than 1;

the retrieval module is used for retrieving each segment of participles based on a preset inverted index library to obtain a first candidate company name entry set of each segment of participles, wherein the first candidate company name entry set comprises a plurality of first candidate company name entries, and at least two participles of the first candidate company name entries are matched with the corresponding segments of the participles;

the second acquisition module is used for acquiring target company name information with highest text similarity with the company name information according to the first candidate company name entries;

and the information input module is used for automatically inputting information into the contract information table corresponding to the contract picture to be processed according to the target company name information and the company address information corresponding to the target company name information.

8. The method of claim 7, wherein the second obtaining module comprises:

the determining submodule is used for determining a second candidate company name entry set of the company name information according to a plurality of first candidate company name entries of each word segmentation, wherein the second candidate company name entries comprise at least one second candidate company name entry, and the number of the words segmented by the second candidate company name entry and the word segmentation sequence is the largest;

and the obtaining submodule is used for obtaining the second candidate company name entry with the highest text similarity as the target company name information according to the text similarity between the company name information and each second candidate company name entry.

9. A computing device, comprising:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the information entry method in combination with RPA and AI according to any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a method of information entry in combination with an RPA and an AI according to any one of claims 1 to 6.