CN112633251A

CN112633251A - Text recognition-based target document acquisition method and device and storage medium

Info

Publication number: CN112633251A
Application number: CN202110017051.5A
Authority: CN
Inventors: 李硕鑫; 傅达
Original assignee: Shenzhen Digital Power Grid Research Institute of China Southern Power Grid Co Ltd
Current assignee: Shenzhen Digital Power Grid Research Institute of China Southern Power Grid Co Ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-09
Anticipated expiration: 2041-01-07
Also published as: CN112633251B

Abstract

The application discloses a method and a device for acquiring a target document based on text recognition and a storage medium. The target document obtaining method based on text recognition comprises the following steps: acquiring a plurality of scanned documents to be processed; identifying the original content of each scanned piece; acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers; calculating the occurrence frequency of each license plate number; and selecting a target scanning piece from the scanning pieces of the plurality of documents to be processed according to the number of times of occurrence of each license plate number. The target document obtaining method based on text recognition improves the selection efficiency of the target document.

Description

Text recognition-based target document acquisition method and device and storage medium

Technical Field

The present application relates to, but not limited to, the field of document auditing, and in particular, to a method and an apparatus for acquiring a target document based on text recognition, and a storage medium.

Background

In the field of document auditing, a manual sampling inspection mode is mostly adopted to select and process a target document, and the current document auditing mode lacks the processing capacity of unstructured data and has the problem of low target document selection efficiency.

Disclosure of Invention

The present application is directed to solving at least one of the problems in the prior art. Therefore, the method for acquiring the target document based on the text recognition is provided, and the selection efficiency of the target document can be improved.

According to the target bill obtaining method based on text recognition in the embodiment of the first aspect of the application, the method comprises the following steps: acquiring a plurality of scanned documents to be processed; identifying the original content of each of the scan pieces; acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers; calculating the occurrence frequency of each license plate number; and selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the number of times of occurrence of each license plate number.

According to the method for acquiring the target document based on the text recognition, at least the following technical effects are achieved: the target document obtaining method based on text recognition improves the selection efficiency of the target document.

According to some embodiments of the application, the selecting a target scanned item from the scanned items of the plurality of documents to be processed according to the number of occurrences of each of the license plate numbers comprises: obtaining a target license plate according to the occurrence frequency of each license plate; and selecting a scanning piece corresponding to the target license plate from the scanning pieces of the multiple documents to be processed as the target scanning piece.

According to some embodiments of the present application, obtaining a target license plate according to a number of occurrences of each of the license plates includes: sequencing according to the number of times of each license plate number from large to small to obtain a sequencing table corresponding to the license plate numbers; and selecting the target license plate from the plurality of license plate numbers according to the sorting table.

According to some embodiments of the present application, the selecting the target license plate from the plurality of license plate numbers according to the ranking table comprises: acquiring a preset sequencing parameter; and selecting the license plate number corresponding to the preset sorting parameter from the sorting table as the target license plate.

According to some embodiments of the present application, the obtaining a scanned piece of a plurality of documents to be processed further comprises: acquiring the bill type of the bill to be processed; and obtaining a scanned piece of the bill to be processed according to the bill type.

According to some embodiments of the present application, the method for acquiring a target document based on text recognition further includes: and obtaining the seal content in the bill to be processed according to the scanned piece of the bill to be processed.

According to some embodiments of the present application, the method for acquiring a target document based on text recognition further includes: and obtaining the signature content in the document to be processed according to the scanned piece of the document to be processed.

According to the second aspect of the application, the target document acquiring device based on text recognition comprises: the scanning piece acquisition module is used for acquiring scanning pieces of a plurality of documents to be processed; the identification module is used for identifying the original content of each scanning piece; the license plate number data acquisition module is used for acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers; the license plate number times calculation module is used for calculating the number of times of occurrence of each license plate number; and the target scanning piece acquisition module is used for selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the number of times of occurrence of each license plate number.

According to the third aspect of the application, the target document acquiring device based on text recognition comprises: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing: the method for acquiring the target document based on the text recognition in the embodiment of the first aspect of the application.

A storage medium according to an embodiment of a fourth aspect of the present application stores computer-executable instructions for: the method for acquiring the target bill based on the text recognition is implemented.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The present application is further described with reference to the following figures and examples, in which:

FIG. 1 is a flowchart of a method for acquiring a target document based on text recognition according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for acquiring a target document based on text recognition according to another embodiment of the present application;

FIG. 3 is a flowchart of a method for acquiring a target document based on text recognition according to another embodiment of the present application;

FIG. 4 is a flowchart of a method for acquiring a target document based on text recognition according to another embodiment of the present application;

FIG. 5 is a flowchart of a method for obtaining a target document based on text recognition according to another embodiment of the present application;

FIG. 6 is a flowchart of a method for obtaining a target document based on text recognition according to another embodiment of the present application;

FIG. 7 is a flowchart of a method for obtaining a target document based on text recognition according to another embodiment of the present application;

fig. 8 is a flowchart of a target document acquiring method based on text recognition according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In the description of the present application, unless otherwise expressly limited, terms such as set, mounted, connected, etc., should be construed broadly, and those skilled in the art can reasonably determine the specific meaning of the terms in the present application in view of the detailed technical solution.

In the description of the present application, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the traditional auditing field, the auditing work thought is flexible, a large amount of data such as images, sounds, articles and the like need to be analyzed and processed, the traditional auditing mode adopts a sampling inspection mode, and whether effective clues can be found depends on the experience and luck of auditors.

For some complex processing scenes, such as scenes with long processing business process, more contents of filling forms and easy error in manual processing, an audit process robot is adopted to assist audit work, but the development cycle of the audit process robot is long, the application scenes are limited, and the processing capacity of unstructured data is limited.

In summary, the working methods and processing tools in the current auditing field have the problems of limited processing capability on unstructured data, large time consumption for manually processing data, long development period of process robots, insufficient review coverage, low reliability of auditing conclusions and the like.

In view of this, the application provides a target document obtaining method based on text recognition, which applies a new auditing work mode in the auditing field, improves the selection efficiency of the target document, and further effectively improves the efficiency and quality of auditing work.

The text recognition-based target document acquisition method introduces an emerging information technology into the auditing field, improves the capability of auditors for acquiring and processing auditing data, and simultaneously solidifies and shares mature auditing thinking and experience; by applying technologies such as semantic analysis, voice recognition, image recognition and character recognition, the unstructured data processing capability is improved, and the time for manually processing data is reduced; the large data is used for clustering and analyzing the electric quantity change condition, so that the coverage of examination work is widened, and the reliability of an audit conclusion is improved; the robot process automation technology is introduced to assist auditors in processing operations with high repeatability, so that the aim of quality improvement and efficiency improvement is fulfilled.

According to the method for acquiring the target document based on the text recognition, the method comprises the following steps: acquiring a plurality of scanned documents to be processed; identifying the original content of each scanned piece; acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers; calculating the occurrence frequency of each license plate number; and selecting a target scanning piece from the scanning pieces of the plurality of documents to be processed according to the number of times of occurrence of each license plate number.

As shown in fig. 1, in some embodiments, a target document acquisition method based on text recognition includes:

s110, obtaining a plurality of scanned documents to be processed;

s120, identifying the original content of each scanning piece;

s130, acquiring license plate number data in each original content;

s140, calculating the occurrence frequency of each license plate number;

s150, selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the number of times of occurrence of each license plate number.

In step S110, scanned documents of a plurality of documents to be processed of a certain type are obtained in the database by means of screening, and the format of the scanned documents includes, but is not limited to, PDF and JPG.

In step S120, the scanned document is processed by using image and text recognition technology, so as to obtain the original content in the scanned document, where the original content includes, but is not limited to, the text content and the license plate number data in the document to be processed.

In step S130, license plate number data is obtained from the obtained original content, and the license plate number data includes at least two different license plate numbers.

In step S140, the number of times that each license plate appears in all documents to be processed is counted, and the number of times is used to determine the frequency of appearance of the document, so as to facilitate auditing and checking of documents to be processed with higher frequency of appearance.

In step S150, a target scanned item is selected according to the number of occurrences, and the target scanned item is a scanned item of a to-be-processed document corresponding to a license plate number with a high frequency of occurrence.

The target document acquisition method based on text recognition improves the unstructured data processing capacity, reduces the time for manually processing data, widens the coverage of inspection work, improves the reliability of audit conclusions, and improves the selection efficiency of target documents.

According to some embodiments of the present application, selecting a target scanned item from a plurality of scanned items of documents to be processed based on the number of occurrences of each license plate number comprises: obtaining a target license plate according to the occurrence frequency of each license plate; and selecting a scanning piece corresponding to the target license plate from the scanning pieces of the multiple documents to be processed as a target scanning piece.

As shown in fig. 2, in some embodiments, the method for acquiring a target document based on text recognition includes:

s210, obtaining a target license plate according to the occurrence frequency of each license plate;

s220, selecting a scanning piece corresponding to the target license plate from the scanning pieces of the documents to be processed as a target scanning piece.

In steps S210 to S220, the target license plate to be processed is determined according to the occurrence number, and a target scanned object corresponding to the target license plate is obtained, so as to facilitate auditing and checking of the target scanned object.

According to some embodiments of the present application, obtaining a target license plate according to the number of occurrences of each license plate number comprises: sequencing according to the number of times of each license plate number from large to small to obtain a sequencing table corresponding to the license plate numbers; and selecting a target license plate from the plurality of license plate numbers according to the sorting table.

As shown in fig. 3, in some embodiments, the method for acquiring a target document based on text recognition includes:

s310, sorting according to the number of times of each license plate number from large to small to obtain a sorting table corresponding to the license plate numbers;

s320, selecting a target license plate from the license plate numbers according to the sorting table.

In steps S310 to S320, the target license plate is selected in a manner of selecting a license plate number with a high frequency of occurrence, that is, sorting is performed according to the number of occurrences of the license plate number, a sorting table is generated in a manner of descending the number of occurrences, and then a desired target license plate (for example, the first 5 records of the sorting table) is selected from the sorting table, thereby obtaining a target scanning piece corresponding to the target license plate.

According to some embodiments of the present application, selecting a target license plate from a plurality of license plate numbers according to a ranking table includes: acquiring a preset sequencing parameter; and selecting the license plate number corresponding to the preset sorting parameter from the sorting table as a target license plate.

As shown in fig. 4, in some embodiments, the method for acquiring a target document based on text recognition includes:

s410, acquiring preset sorting parameters; and S420, selecting the license plate number corresponding to the preset sorting parameter from the sorting table as a target license plate.

In steps S410 to S420, the preset sorting parameter is the number of target license plates to be acquired, and if the sorting parameter is 5, the first 5 records in the sorting table need to be acquired, that is, the target scanning pieces corresponding to the first 5 target license plates with the highest frequency of occurrence are acquired, so that an auditor can conveniently audit the target scanning pieces.

According to some embodiments of the present application, obtaining a scanned piece of a plurality of documents to be processed further comprises: acquiring the bill type of a bill to be processed; and obtaining a scanned piece of the bill to be processed according to the bill type.

As shown in fig. 5, in some embodiments, the method for acquiring a target document based on text recognition includes:

s510, acquiring a bill type of a bill to be processed;

and S520, obtaining a scanned piece of the bill to be processed according to the bill type.

In steps S510 to S520, the document type includes but is not limited to a charge type of the document, such as vehicle maintenance cost, and after the document type is determined, the document type is screened in the database, so as to obtain a scanned document meeting the type of the document to be processed.

According to some embodiments of the application, the target document acquiring method based on text recognition further comprises: and obtaining the seal content in the document to be processed according to the scanned piece of the document to be processed.

As shown in fig. 6, in some embodiments, the method for acquiring a target document based on text recognition includes:

s610, acquiring a scanned document to be processed;

and S620, obtaining the seal content in the document to be processed according to the scanned piece of the document to be processed.

In steps S610 to S620, in a specific embodiment, the target document obtaining method based on text recognition can also recognize the stamp text content and the number of stamps in the scanned piece of the document to be processed, so that an auditor can audit the document to be processed.

According to some embodiments of the application, the target document acquiring method based on text recognition further comprises: and obtaining the signature content in the document to be processed according to the scanned piece of the document to be processed.

As shown in fig. 7, in some embodiments, the method for acquiring a target document based on text recognition includes:

s710, acquiring a scanned piece of a document to be processed;

s720, obtaining the signature content in the document to be processed according to the scanned document of the document to be processed.

In steps S710 to S720, in a specific embodiment, the target document acquiring method based on text recognition can also recognize the content of the signed text and the number of signed documents in the scanned document of the document to be processed, so that an auditor can audit the document to be processed conveniently.

Referring to fig. 8, a method for acquiring a target document based on text recognition is described in detail in a specific embodiment. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting.

S810, screening out a scanned piece of a vehicle maintenance fee bill;

s820, identifying and sorting the content in the scanning piece;

s830, acquiring license plate numbers in the content of the scanned part, and calculating the number of times of occurrence of the license plate numbers;

and S840, selecting the document scanning piece corresponding to the 5 license plate numbers with the largest occurrence frequency.

The embodiment explains the target document obtaining method based on text recognition in a specific application scene, and the specific application scene is an audit scene of vehicle repair fee documents.

If the problem of false-positive vehicle repair fee needs to be found, the specific content of the repair document in the financial reimbursement certificate attachment needs to be checked. The traditional auditing mode adopts a sampling inspection mode, and whether effective clues can be found depends on the experience and fortune of auditors; the text recognition-based target document acquisition method combines the technologies of character recognition, license plate recognition and the like, can directly count the vehicles with the most repaired documents in the auditing period, and collects the sorted relevant repaired documents for the analysis of auditors.

In step S810, log on to the 4A system, enter the financial system through the 4A portal, click into the all documents query function module in the financial system, and screen and export data according to the vehicle maintenance cost on the financial list page. And reading the derived data by using the process robot, and acquiring a scanned part of the vehicle repair fee bill according to the derived data.

In step S820, the process robot obtains the financial data, opens the intelligent auditing system, enters the page of the text recognition tool, uploads the scanned documents obtained from the financial system to the text recognition tool, and then sequentially converts the scanned documents into documents and exports the documents.

In step S830, after the scanned part converts the text, the license plate number recognition tool is entered, and the converted text is sequentially uploaded to obtain the number of times of occurrence of the license plate number of each text.

In step S840, the process robot identifies five vehicles with the largest number plate, and arranges all the scanned parts with the number plate in a folder for the auditor to check.

And auditing personnel can perform auditing and checking according to a feedback result of the unstructured data processed by the process robot. Therefore, the audit efficiency is effectively improved, the labor cost is reduced, the coverage of the audit work is improved, and the reliability of the audit work is improved.

The method for acquiring the target document based on the text recognition further comprises the steps of searching the project construction content for duplication, finding out text segments with high similarity of the project construction content, and supporting batch comparison of a plurality of subjects.

It should be noted that the text recognition-based target document acquisition method further includes intelligently grouping the customers, that is, grouping the customers by using the hourly power change state through a computer by using a big data technology, and attempting to analyze the categories of the customers according to the power change conditions.

According to the embodiment of the application, the target document acquiring device based on text recognition comprises: the scanning piece acquisition module is used for acquiring scanning pieces of a plurality of documents to be processed; the identification module is used for identifying the original content of each scanning piece; the license plate number data acquisition module is used for acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers; the license plate number times calculation module is used for calculating the number of times of occurrence of each license plate number; and the target scanning piece acquisition module is used for selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the occurrence frequency of each license plate number.

The target document acquisition device based on the text recognition realizes the target document acquisition method based on the text recognition, improves the unstructured data processing capability, reduces the time for manually processing data, widens the coverage of inspection work, improves the reliability of audit conclusions and improves the selection efficiency of target documents.

According to the embodiment of the application, the target document acquiring device based on text recognition comprises: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing: the method for acquiring the target document based on the text recognition in any embodiment of the application.

A storage medium according to an embodiment of the present application stores computer-executable instructions for: and executing the target document acquisition method based on text recognition in any embodiment.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The embodiments of the present application have been described in detail with reference to the drawings, but the present application is not limited to the embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present application. Furthermore, the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

Claims

1. The method for acquiring the target document based on text recognition is characterized by comprising the following steps:

acquiring a plurality of scanned documents to be processed;

identifying the original content of each of the scan pieces;

acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers;

calculating the occurrence frequency of each license plate number;

and selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the number of times of occurrence of each license plate number.

2. The method for acquiring a target document based on text recognition according to claim 1, wherein the selecting a target scanned document from the scanned documents of the documents to be processed according to the number of occurrences of each license plate number comprises:

obtaining a target license plate according to the occurrence frequency of each license plate;

and selecting a scanning piece corresponding to the target license plate from the scanning pieces of the multiple documents to be processed as the target scanning piece.

3. The method for acquiring the target document based on the text recognition as claimed in claim 2, wherein the obtaining of the target license plate according to the number of occurrences of each license plate number comprises:

sequencing according to the number of times of each license plate number from large to small to obtain a sequencing table corresponding to the license plate numbers;

and selecting the target license plate from the plurality of license plate numbers according to the sorting table.

4. The method for acquiring the target document based on the text recognition as claimed in claim 3, wherein the selecting the target license plate from the plurality of license plate numbers according to the sorting table comprises:

acquiring a preset sequencing parameter;

and selecting the license plate number corresponding to the preset sorting parameter from the sorting table as the target license plate.

5. The method for acquiring a target document based on text recognition according to claim 1, wherein the acquiring a scanned piece of a plurality of documents to be processed further comprises:

acquiring the bill type of the bill to be processed;

and obtaining a scanned piece of the bill to be processed according to the bill type.

6. The method for acquiring the target document based on the text recognition as claimed in claim 1, wherein the method for acquiring the target document based on the text recognition further comprises:

and obtaining the seal content in the bill to be processed according to the scanned piece of the bill to be processed.

7. The method for acquiring the target document based on the text recognition as claimed in claim 1, wherein the method for acquiring the target document based on the text recognition further comprises:

and obtaining the signature content in the document to be processed according to the scanned piece of the document to be processed.

8. Target document acquisition device based on text recognition, characterized by including:

the scanning piece acquisition module is used for acquiring scanning pieces of a plurality of documents to be processed;

the identification module is used for identifying the original content of each scanning piece;

the license plate number data acquisition module is used for acquiring license plate number data in each original content; the license plate number data comprises at least two different license plate numbers;

the license plate number times calculation module is used for calculating the number of times of occurrence of each license plate number;

and the target scanning piece acquisition module is used for selecting a target scanning piece from the scanning pieces of the multiple documents to be processed according to the number of times of occurrence of each license plate number.

9. Target document acquisition device based on text recognition, characterized by including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing:

a target document acquisition method based on text recognition according to any one of claims 1 to 7.

10. A storage medium having stored thereon computer-executable instructions for:

executing the target document acquiring method based on text recognition according to any one of claims 1 to 7.