CN113626655A - Method for extracting information in file, computer equipment and storage device - Google Patents
Method for extracting information in file, computer equipment and storage device Download PDFInfo
- Publication number
- CN113626655A CN113626655A CN202110888061.6A CN202110888061A CN113626655A CN 113626655 A CN113626655 A CN 113626655A CN 202110888061 A CN202110888061 A CN 202110888061A CN 113626655 A CN113626655 A CN 113626655A
- Authority
- CN
- China
- Prior art keywords
- audited
- file
- files
- information
- audit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 238000012550 audit Methods 0.000 claims abstract description 105
- 238000000605 extraction Methods 0.000 claims abstract description 44
- 230000014509 gene expression Effects 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims description 26
- 238000005516 engineering process Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 12
- 230000002159 abnormal effect Effects 0.000 claims description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000012797 qualification Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 230000003287 optical effect Effects 0.000 claims description 3
- 238000012015 optical character recognition Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012384 transportation and delivery Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a method for extracting information in a file, computer equipment and a storage device. The method comprises the following steps: acquiring a file to be audited of a project to be audited; the files to be evaluated comprise bid business files and bid technical files related to the items to be evaluated; extracting audit business information in the bid business file by using a preset extraction model; and extracting audit scheme information in the bidding technical file by using the regular expression based on a preset extraction rule. By the scheme, information in the file to be audited can be extracted, and auditors are assisted in auditing, so that auditing work efficiency is improved.
Description
Technical Field
The present application relates to the field of auditing processing technologies, and in particular, to a method, a computer device, and a storage apparatus for extracting information from a file.
Background
In the course of bidding for each project, there are a lot of bidding documents for each project and bidding enterprise, and the auditing department usually needs to audit and supervise the bidding and other work of each project.
At present, in the process of manual auditing, bidding documents of each project and bidding enterprise need to be manually collected, a large number of auditors need to look up the bidding documents one by one, various auditing information needs to be manually searched and recorded, and a large number of complex operation steps such as manual screening, summarizing, counting, comparison and the like are needed. In the whole auditing process, auditors need to manually consult a large amount of bidding materials to extract valuable auditing clues from the bidding materials, and the auditing workload is large.
Disclosure of Invention
The technical problem mainly solved by the application is to provide the method for extracting the information in the file, the computer equipment and the storage device, so that the information in the file to be audited can be extracted, and auditors can be assisted in auditing, so that the auditing work efficiency is improved.
In order to solve the above problem, a first aspect of the present application provides a method for extracting information in a file, the method including: acquiring a file to be audited of a project to be audited; the files to be evaluated comprise bid business files and bid technical files related to the items to be evaluated; extracting audit business information in the bid business file by using a preset extraction model; and extracting audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule.
In order to solve the above problem, a second aspect of the present application provides a computer device, which includes a memory and a processor coupled to each other, wherein the memory stores program data, and the processor is configured to execute the program data to implement any one of the above methods for extracting information in a file.
In order to solve the above problem, a third aspect of the present application provides a storage device storing program data capable of being executed by a processor, the program data being used for implementing any one of the steps of the above method for extracting information in a file.
According to the scheme, the files to be audited of the project to be audited are obtained; the files to be evaluated comprise bid business files and bid technical files related to the items to be evaluated; extracting audit business information in the bid business file by using a preset extraction model; and extracting audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule, and automatically extracting information required by audit from the files to be audited, so that auditors are assisted to audit, the auditors are prevented from reading the files to be audited one by one, and the audit work efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the present application, the drawings required in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor. Wherein:
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of an audit processing method of the present application;
FIG. 2 is a flowchart illustrating an embodiment of step S12 in FIG. 1;
FIG. 3 is a schematic flowchart of an embodiment of a method for extracting information from a document according to the present application;
FIG. 4 is a flowchart illustrating an embodiment of step S23 in FIG. 3;
FIG. 5 is a flowchart illustrating an embodiment of step S13 of FIG. 1;
FIG. 6 is a flowchart illustrating an embodiment of a method for calculating similarity of texts according to the present application;
FIG. 7 is a flowchart illustrating an embodiment of step S33 of FIG. 6;
FIG. 8 is a flowchart illustrating a method for calculating similarity of texts according to another embodiment of the present application;
FIG. 9 is a schematic block diagram of an embodiment of a computer apparatus of the present application;
FIG. 10 is a schematic structural diagram of an embodiment of a memory device according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first" and "second" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The present application provides the following examples, each of which is specifically described below.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an auditing method according to an embodiment of the present application. The method may comprise the steps of:
s11: and establishing an audit file library of the project to be audited, wherein the audit file library is used for storing the file to be audited related to the project to be audited.
Audit is taken as a supervision mechanism, and is an independent economic supervision activity for maintaining financial and financial disciplines, improving management and improving economic benefits by using a special method to examine and supervise the authenticity, correctness, compliance, legality and profitability of the finance, financial balance, operation management activities and related data of an audited unit according to national regulations, audit criteria and accounting theory and by using a special method.
In the process of auditing and supervising the auditing project and the like, a large number of auditing files in the auditing project need to be audited. The present application takes the bidding project as the project to be audited for explanation, but the present application is not limited to this.
In the process of auditing, an auditing file library of the project to be audited can be established, wherein the auditing file library can be used for storing the files to be audited related to the project to be audited. In the audit document library, bid documents and bid documents of a plurality of items to be audited can be stored, and the bid documents and/or the bid documents can be used as the files to be audited.
The bidding document is an outline of bidding engineering construction, is a working basis for construction units to implement engineering construction, and provides all conditions required for bidding participation to the bidding units.
The bidding document refers to a response document which is required to be compiled by a bidder and generally comprises a business document, a technical document, a quotation document and other parts. The bid document generally comprises three parts: credit-providing part, business part and technical part. The credit standing part comprises a series of contents such as company qualification, company condition introduction and the like, and also related contents such as other files required to be provided by the bidding document and the like, including company performance, various certificates, reports and the like. The technical part comprises technical schemes such as engineering description, design and construction scheme, engineering amount lists, personnel configuration, drawings, tables and other technical related data. The business segment includes bid quotation specifications, total bid prices, major material price tables and contractual terms (general and special), etc.
Establishing an audit file library of the project to be audited, specifically, storing the files to be audited in the audit file library in a classified manner according to the project to be audited and the file format of the files to be audited; wherein, the file format includes: at least one of an electronic Document Format and other formats, wherein the other formats include any one of a picture and a Portable Document Format (PDF). For example, files to be audited of different items to be audited are stored in a classified mode, and files to be audited of different file formats in the same item to be audited are stored in an audit file library in a classified mode. In addition, the distributed file system can be adopted to store the files to be audited in the audit file library in a classified mode.
S12: and acquiring at least one file to be audited from the audit file library, and acquiring audit key information of the file to be audited.
At least one file to be audited can be obtained from the audit file library, if one target project to be audited is audited, a plurality of files to be audited of the target project to be audited can be obtained from the audit file library, namely a plurality of or all bid files of the target project to be audited are obtained. Audit key information required for auditing the project to be audited is obtained from the files to be audited, and the audit key information required by corresponding auditing can be extracted according to the specific audit project. Such as bidding enterprises, bidding enterprise qualifications, bid quotes, bidding techniques, etc., to which the present application is not limited.
S13: and auditing the file to be audited based on the auditing key information.
Based on the extracted audit key information of the multiple files to be audited of each item to be audited, for example, based on the association relationship among the audit key information, each audit key information can be analyzed and mined so as to audit the files to be audited. In addition, the similarity between the files to be audited can be obtained based on the auditing key information, so that the auditing treatment is carried out on the files to be audited based on the similarity between the files to be audited.
In the embodiment, an audit file library of the project to be audited is established, and the audit file library is used for storing the file to be audited related to the project to be audited; acquiring at least one file to be audited from an audit file library, and acquiring audit key information of the file to be audited; based on the audit key information, the files to be audited are audited, so that the manual audit of the files to be audited by auditors can be avoided, the audit workload is saved, and the audit processing efficiency is improved.
In some embodiments, referring to fig. 2, the step S12 may include the following steps:
s121: and if the file format of the file to be audited is other formats, converting the file to be audited into an electronic document format file or structured data, wherein the other formats comprise any one of pictures and portable document formats.
Before obtaining the audit key information of the file to be audited in the step S12, the method may include: and detecting the file format of the file to be audited, and if the file format of the file to be audited is other formats, and the other formats comprise any one of pictures and portable document formats, converting the file to be audited into an electronic document format file or structured data. The electronic Document format may be an electronic book format, and may include any format such as DOC (Document), PPT (PowerPoint, slide format), TXT (text Document), and the like, where the format has a function of recording image-text information. Structured data, also called row data, is data logically represented and implemented by a two-dimensional table structure, strictly following the data format and length specifications, and mainly stored and managed by a relational database. The structured data is, for example, data in an EXcel document format, and the like, and the application is not limited thereto.
Alternatively, the pending file may be converted to an electronic document format file or structured data. Specifically, an Optical Character Recognition technology (OCR technology for short) may be used to perform Character Recognition on the file to be audited, so as to recognize Character information in the file to be audited, for example, an OCR technology may be used to perform Character Recognition on the file to be audited in a picture format, and convert characters in the picture into a text format, so as to obtain a Recognition result. Thus, an electronic document format file or a pending file of structured data can be generated based on the result of the recognition.
Optionally, the input additional information of the file to be audited may also be obtained, and the additional information may be information manually input by a worker to the file to be audited. The pending files may be converted to electronic document format files or to pending files of structured data based on the input additional information.
Optionally, when character recognition is performed on the file to be audited by using an optical character recognition technology, for the character which cannot be recognized, input additional information of the file to be audited may be obtained, where the additional information is information manually input by a worker on the character which cannot be recognized in OCR technology recognition. Thus, an electronic document format file or a pending file of structured data can be generated based on the result of recognition by the OCR technology, the input additional information.
The obtaining of the audit key information of the file to be audited in step 12 may include the following steps:
s122: and extracting audit business information in the bidding business document by using a preset extraction model.
The pending documents comprise bidding business documents and bidding technical documents related to the pending project. That is, the pending documents at least include the bid business documents and the bid technical documents in the bid documents. In addition, in some embodiments, the pending documents may also include a quotation document and other partial documents in the bid document, which is not limited in this application.
Different extraction modes can be respectively adopted for the bidding business document and the bidding technical document to extract the auditing key information in the bidding business document and the bidding technical document.
For the bidding business document, audit business information in the bidding business document can be extracted by using a preset extraction model, wherein the preset extraction model is a model established based on machine learning, and the preset extraction model can be trained before being used. The following embodiments may be specifically referred to in the training process of the preset extraction model.
Wherein, the extracting of the audit business information in the bid business document may include: at least one of qualification information of the bidding enterprise, enterprise information of the bidding enterprise, bidding quotation information, delivery date, and the like.
In some embodiments, auditing the business information may further include: at least one of bid item name, bid item number, bid enterprise name, bid agent identification number, and bid time. The application is not so limited with respect to auditing business information.
S123: and extracting audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule.
For the bidding technical file, based on a preset extraction rule, the audit scheme information in the bidding technical file can be extracted by using a regular expression. The preset extraction rule can be an extraction rule which is configured in advance based on the bidding technical file of the project to be reviewed.
The audit scheme information extracted from the bid technology document may include: at least one of bid item name, bid item number, bid enterprise name, bid agent identification number, bid time, and chapter structure information. In addition, the chapter structure information includes at least one of a project situation, a service scheme introduction, a service process, service arrangement after the project is finished, a progress control measure, and a quality measure chapter text, which is not limited in this application.
In some embodiments, after obtaining the audit key information of the file to be audited, the audit key information may also be stored in the audit file library. For example, a bid information table and a bid technical data table may be established in the audit document library, and the bid information table may be used to store the audit business information extracted from the bid business document, that is, at least one of a bid item name, a bid item number, a bid enterprise name, a bid agent identity number, and a bid time. The bidding technique data table is used for storing auditing scheme information extracted from the bidding technique file, namely at least one of a bidding project name, a bidding project number, a bidding enterprise name, a bidding agent identity card number, bidding time, project condition, service scheme introduction, service process, service arrangement after the project is finished, a progress control measure and a quality measure chapter text.
In some embodiments, for step S12 above, the present application also provides a method of extracting information in a document. Referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of a method for extracting information from a document according to the present application. The method comprises the following steps:
s21: acquiring a file to be audited of a project to be audited; the files to be evaluated comprise bid business files and bid technical files related to the items to be evaluated.
And (4) storing bidding documents and bidding documents related to the project to be evaluated in a classification manner in the audit document library. The bid documents may also include bid commerce documents, bid technology documents.
The method can obtain the files to be audited of the project to be audited from the audit file library, wherein the files to be audited stored in the audit file library are stored according to the project to be audited and the file format of the files to be audited, and the file format comprises the following steps: an electronic document format, and at least one of other formats.
After the files to be audited of the project to be audited are obtained, if the file format of the files to be audited is other formats, the files to be audited are converted into files in an electronic document format or structured data; wherein, the other formats comprise any one of pictures and portable document formats. Converting the pending file into an electronic document format file or structured data, comprising: identifying a file to be audited by utilizing an optical character identification technology, and generating an electronic document format file or structured data based on an identification result; and/or acquiring input additional recording information of the file to be audited, and converting the file to be audited into an electronic document format file or structured data based on the input additional recording information.
S22: and extracting audit business information in the bidding business document by using a preset extraction model.
S23: and extracting audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule.
In this embodiment, the steps S22 and S23 may be executed simultaneously, and the execution sequence of the steps S22 and S23 in this application is not limited to this.
In this embodiment, reference may be made to the specific implementation process of step S12 in the above embodiment for the specific implementation of step S21, which is not described herein again.
In the embodiment, files to be audited of projects to be audited are obtained; the files to be evaluated comprise bid business files and bid technical files related to the items to be evaluated; extracting audit business information in the bid business file by using a preset extraction model; and extracting audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule, and automatically extracting information required by audit from the files to be audited, so that auditors are assisted to audit, the auditors are prevented from reading the files to be audited one by one, and the audit work efficiency is improved.
In some embodiments, the preset extraction model may be trained in advance before step 22 described above. And extracting audit business information in the bidding business document by using the trained preset extraction model.
Specifically, a plurality of sample pending documents may be collected, wherein the sample pending documents may be specific representative bid documents screened from the aggregated bid documents. In addition, reference audit business information is marked on the sample of the document to be audited, and the reference audit business information can be obtained by manually marking the bidding document according to the audit business information needing to be extracted from the bidding document.
And inputting the file sample to be examined into a preset extraction model, and training the preset extraction model based on a sequence labeling algorithm. The sequence labeling algorithm is, for example, a Conditional Random field algorithm (CRF algorithm for short), and may be used to train a preset extraction model, and the CRF algorithm may be used to label and segment ordered data and may be used to solve the sequence labeling problem. In the training process of the preset extraction model, reference audit business information in a file sample to be audited can be marked and segmented.
In some embodiments, referring to fig. 4, the step S23 may include the following steps:
s231: and configuring preset extraction rules for the bidding technical files according to the chapter structures of the bidding technical files.
The preset extraction rules can be configured for the bidding technical files according to the chapter structures of the bidding technical files. For example, the chapter structure of a bid technology document includes: project condition, service scheme introduction, service process, service arrangement after the project is finished, progress control measures and quality measures. A corresponding preset extraction rule may be configured for each chapter structure.
S232: and extracting the structural information of each section in the bidding technical file by adopting a regular expression based on a preset extraction rule so as to obtain the auditing scheme information.
Based on the preset extraction rule, the regular expression can be adopted to realize the preset extraction rule. The regular expression is a logic formula for operating on character strings, namely a 'regular character string' is formed by using a plurality of specific characters defined in advance and a combination of the specific characters, and the 'regular character string' is used for expressing a filtering logic for the character strings. Regular expressions can be used to retrieve, replace, text that conforms to a certain pattern (rule).
By adopting the regular expression, the structure information of each section in the bidding technical file can be extracted, and the structure information of each section is used as auditing scheme information. The chapter structure information comprises at least one of project conditions, service scheme introduction, service process, service arrangement after the project is finished, progress control measures and quality measure chapter texts.
In this application, the audit business information and/or the audit scheme information obtained in the above steps 22 and S23 may be used to perform audit processing on the to-be-audited file of the to-be-audited item. The audit business information and/or audit scheme information can be used as audit key information, and the audit processing on the file to be audited can include: and auditing the files to be audited of the project to be audited by utilizing the incidence relation between the auditing business information and/or the auditing scheme information in each file to be audited. Acquiring the similarity between auditing scheme information of each file to be audited in the project to be audited by using a preset similarity calculation method; and taking the files to be audited with the similarity larger than the preset threshold value as abnormal files, and generating the auditing result of the project to be audited. The process can be referred to in particular in the following examples.
In some embodiments, referring to fig. 5, in the step S13, performing audit processing on the file to be audited based on the audit key information may include the following steps:
s131: and auditing the files to be audited by utilizing the incidence relation among the auditing key information of the files to be audited.
The audit business information and/or audit scheme information may be used as audit key information, that is, the audit key information may include audit business information that may include: the system comprises at least one of qualification information of the bidding enterprise, enterprise information of the bidding enterprise, bidding quotation information, delivery date, bidding project name, bidding project number, bidding enterprise name, bidding agent identity number, bidding time, project condition, service scheme introduction, service process and service arrangement after the project is finished, progress control measures and quality measures.
And auditing the files to be audited by utilizing the incidence relation among the auditing key information of the files to be audited. For example, the association relationship among each project to be audited, the bidding enterprise and the bidding agent is analyzed, the enterprise list of the bidding agent which is frequently exchanged and the like can be obtained, and the association relationship among the auditing key information can be utilized for analysis and mining, so that the analysis result is obtained, and the reference value is provided for auditing.
S132: acquiring the similarity between the files to be examined in the project to be examined by using a preset similarity calculation method; and taking the files to be audited with the similarity larger than the preset threshold value as abnormal files, and generating the auditing result of the project to be audited.
The similarity between the files to be audited in the project to be audited can be obtained by utilizing a preset similarity algorithm based on the auditing key information of each auditing file, so that the auditing treatment can be carried out on the files to be audited based on the similarity. And acquiring the repetition rate of each document to be evaluated based on the similarity between the documents to be evaluated, namely, carrying out duplication checking treatment on each document to be evaluated, so as to extract a similar bidding enterprise list of the documents to be evaluated.
Optionally, the similarity between the documents to be reviewed of the project to be reviewed can be obtained by using a preset similarity algorithm based on the auditing scheme information in the bidding technical document, wherein the preset similarity algorithm can be a text similarity algorithm based on the editing distance. If the similarity between the two files to be evaluated is greater than a preset threshold value, the preset threshold value can be set to be 0.4-0.6; the two files to be audited may be the same, the file to be audited with the similarity larger than the preset threshold value is taken as an abnormal file, and an auditing result of the item to be audited is generated based on the abnormal file. In addition, chapter structure information with high similarity in the two files to be audited can be used as an audit evidence.
In some embodiments, for the step S13, the present application provides a text similarity calculation method. Referring to fig. 6, fig. 6 is a flowchart illustrating an embodiment of a method for calculating similarity of texts according to the present application. The method comprises the following steps:
s31: and acquiring a plurality of files to be checked of the items to be checked, and acquiring the structure information of each chapter of the files to be checked.
The method can obtain files to be audited of the project to be audited from the audit file library, wherein the files to be audited stored in the audit file library are stored in a classified mode according to the project to be audited and the file format of the files to be audited, and the file format comprises the following steps: at least one of an electronic document format and other formats, which may include a picture format or a portable document format, and the present application is not limited thereto.
Optionally, the pending file in this embodiment may include a bidding technology file in the bidding file. When the structure information of each section of the file to be audited is obtained, a preset extraction rule can be configured for the bidding technical file according to the section structure of the bidding technical file, so that the structure information of each section of the bidding technical file is extracted by adopting a regular expression based on the preset extraction rule to serve as the structure information of each section of the file to be audited in the embodiment.
The chapter structure information of the pending file can comprise at least one of project conditions, service scheme introduction, service process, service arrangement after the project is finished, progress control measures and quality measure chapter texts.
The specific implementation process of this step in this implementation may refer to the implementation process of the above embodiment, and is not described herein again.
S32: and determining the similarity of the structures of the sections corresponding to the files to be examined based on the structural information of the sections among the files to be examined.
And determining the similarity of the structures of the sections corresponding to the files to be examined by using a text similarity calculation method based on the editing distance. That is, the similarity of the structure of each corresponding chapter in the documents to be examined, i.e., the project condition, the service scheme introduction, the service process, the service arrangement after the project is finished, the progress control measures and the quality measures can be obtained. The Edit Distance (ED) of a text may refer to the minimum number of editing operations required to convert one text string into another text string between two text strings. The editing operation includes the following: adding a character, deleting a character, modifying a character. The minimum edit distance literally reflects the degree of difference between the two texts, i.e. the more similar the two texts are, the smaller the edit distance is.
S33: and determining the similarity between the plurality of files to be examined based on the similarity corresponding to each chapter structure between the plurality of files to be examined and the weight of each chapter structure.
Weighting, such as weighted summation, weighted averaging and the like, can be performed based on the similarity of the structures of the sections of the documents to be examined and the weight of the structures of the sections, and the weighting result can be taken as the similarity between the documents to be examined.
In the embodiment, a plurality of files to be checked of the items to be checked are obtained, and the structure information of each chapter of the plurality of files to be checked is obtained; determining the similarity of the structures of the sections corresponding to the files to be examined based on the structure information of the sections among the files to be examined; the method comprises the steps of determining the similarity between a plurality of files to be audited based on the similarity between the files to be audited and corresponding to the structures of all chapters and the weight of the structures of all chapters, analyzing the files to be audited of mass projects by obtaining the similarity between the files to be audited, finding out the similar files to be audited, and assisting auditors to audit so as to improve the auditing work efficiency.
In some embodiments, referring to fig. 7, the step S33 may include the following steps:
s331: and carrying out normalization processing on the similarity of the structures of all the sections of the files to be examined, and taking the result of the normalization processing as the similarity of the structures of all the sections corresponding to the files to be examined.
The similarity of the structures of the sections corresponding to the files to be examined can be normalized, so that the similarity of the structures of the sections ranges from 0 to 1, and the result of the normalization processing is used as the similarity of the structures of the sections. The closer the similarity of the corresponding chapter structures is to 1, the higher the similarity between the two chapter structures is, and conversely, the lower the similarity between the two chapter structures is.
S332: and carrying out weighted average on the similarity of the structures of all the chapters of the files to be examined and the weight corresponding to the structures of all the chapters to obtain the similarity between the files to be examined.
The corresponding weight can be set for each chapter structure respectively, and the weight corresponding to each chapter structure can be set according to the bidding technical file of the specific project to be reviewed, which is not limited in the present application. For example, the weight of the case of the item in the chapter structure is 0.1, the weight of the introduction of the service plan is 0.4, the weight of the service arrangement after the service process and the item are ended is 0.2, the weight of the progress control measure is 0.15, and the weight of the quality measure is 0.15.
And in the plurality of files to be examined, carrying out weighted average on the similarity of the structures of all chapters and the corresponding weight of the structures of all chapters of every two files to be examined, and taking the weighted average as the similarity between the two files to be examined. Therefore, the similarity between every two documents to be examined in the documents to be examined can be obtained. And in the similarity between the current document to be evaluated and a plurality of documents to be evaluated, taking the similarity of two documents to be evaluated with the highest similarity as the similarity between the two documents to be evaluated.
Optionally, the similarity between the files to be audited can be used for auditing the files to be audited of the project to be audited. Specifically, if the similarity of the files to be audited is greater than the preset threshold, the files to be audited with the similarity greater than the preset threshold are used as abnormal files, and an auditing result of the project to be audited is generated.
The specific implementation of this embodiment can refer to the implementation process of the above embodiment, and is not described herein again.
Referring to fig. 8, fig. 8 is a schematic flowchart illustrating a method for calculating similarity of texts according to another embodiment of the present application. The method comprises the following steps:
s40: and acquiring a plurality of files to be checked of the items to be checked, and acquiring the structure information of each chapter of the files to be checked.
S41: and selecting one audit project from the project to be audited as a target audit project.
S42: and selecting two files to be checked as target files to be checked at will from the files to be checked in the target audit project.
S43: and determining the similarity of the structures of the sections corresponding to the target files to be examined based on the structural information of the sections between the target files to be examined.
S44: and determining the similarity between the target files to be examined based on the similarity between the target files to be examined and corresponding to the structures of the sections and the weight of the structures of the sections.
S45: and judging whether the similarity between the target files to be evaluated is greater than a preset threshold value.
If the value is greater than the preset threshold value, executing step S46; if not, go to step S47.
S46: and taking the target files to be audited with the similarity larger than the preset threshold value as abnormal files, and generating an auditing result of the target files to be audited.
Similar chapter structure information in the target document to be audited, a bidding enterprise corresponding to the target document to be audited, a bidding agent, a target project to be audited corresponding to bidding and the like can be used for generating an auditing result of the target document to be audited.
S47: and detecting whether all files to be audited under the target audit project are traversed.
If yes, step S48 is executed, otherwise, step S42 is executed.
S48: and detecting whether all the projects to be examined are traversed.
If yes, go to step S49; otherwise, execution continues with step S41.
S49: and outputting an auditing result of the file to be audited in the project to be audited.
The audit result generated for the abnormal file in step S46 can be obtained, and the bidding enterprise corresponding to the abnormal file, the project to be audited of the corresponding bid, the bid file, and the similar chapter structure and bid agent in the bid file, etc. can be used as the audit result of the file to be audited. Therefore, the auditing results of the abnormal project in the project to be audited, the bidding enterprises corresponding to the abnormal document and the like can be generated.
In this embodiment, reference may be made to the implementation process of the above embodiment for specific implementation of steps S40 to S49, which are not described herein again.
For the above embodiments, the present application provides a computer device, please refer to fig. 9, and fig. 9 is a schematic structural diagram of an embodiment of the computer device of the present application. The computer device 500 comprises a memory 501 and a processor 502, wherein the memory 501 and the processor 502 are coupled to each other, the memory 501 stores program data, and the processor 502 is configured to execute the program data to implement the steps of any of the above-mentioned methods.
In this embodiment, the processor 502 may also be referred to as a CPU (Central Processing Unit). The processor 502 may be an integrated circuit chip having signal processing capabilities. The processor 502 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 502 may be any conventional processor or the like.
The specific implementation of this embodiment can refer to the implementation process of the above embodiment, and is not described herein again.
For the method of the above embodiment, it can be implemented in the form of a computer program, so that the present application provides a storage device, please refer to fig. 10, where fig. 10 is a schematic structural diagram of an embodiment of the storage device of the present application. The storage means 600 has stored therein program data 601 executable by a processor, the program data being executable by the processor to implement the steps of any of the embodiments of the method described above.
The specific implementation of this embodiment can refer to the implementation process of the above embodiment, and is not described herein again.
The storage device 600 of this embodiment may be a medium that can store program data, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or may be a server that stores the program data, and the server may transmit the stored program data to other devices for operation, or may self-operate the stored program data.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a storage device, which is a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.
Claims (10)
1. A method for extracting information from a document, comprising:
acquiring a file to be audited of a project to be audited; wherein the files to be audited comprise bidding business files and bidding technical files related to the project to be audited;
extracting audit business information in the bid business file by using a preset extraction model; and the number of the first and second groups,
and extracting the audit scheme information in the bidding technical file by using a regular expression based on a preset extraction rule.
2. The method of claim 1, wherein before extracting the audit business information in the bid business document using a preset extraction model, the method further comprises: training the preset extraction model;
the training the preset extraction model comprises:
collecting a file sample to be audited, wherein the file sample to be audited is marked with reference audit business information;
and inputting the file sample to be audited into the preset extraction model based on a sequence labeling algorithm so as to train the preset extraction model.
3. The method according to claim 1, wherein the extracting audit scheme information in the bidding technical document by using a regular expression based on a preset extraction rule comprises:
configuring the preset extraction rule for the bidding technical file according to the chapter structure of the bidding technical file;
and extracting the structure information of each section in the bidding technical file by adopting the regular expression based on the preset extraction rule so as to obtain the auditing scheme information.
4. The method of claim 1,
the audit business information comprises: at least one of qualification information of the bidding enterprise, enterprise information of the bidding enterprise and bidding quotation information;
the audit scheme information includes: at least one of bid item name, bid item number, bid enterprise name, bid agent identification number, bid time, and chapter structure information;
the chapter structure information comprises at least one of project conditions, service scheme introduction, service process, service arrangement after the project is finished, progress control measures and quality measure chapter texts.
5. The method of claim 1, wherein the obtaining of the document to be audited for the project to be audited comprises:
acquiring files to be audited of projects to be audited from an audit file library, wherein the files to be audited stored in the audit file library are stored in a classified mode according to the projects to be audited and the file formats of the files to be audited, and the file formats comprise: an electronic document format, and at least one of other formats.
6. The method of claim 5, wherein after said obtaining a file to be audited for an audited project, the method further comprises:
if the file format of the file to be audited is the other format, converting the file to be audited into an electronic document format file or structured data; wherein the other formats comprise any one of pictures and portable document formats.
7. The method of claim 6, wherein converting the pending file to an electronic document format file or structured data comprises:
identifying the file to be checked by utilizing an optical character identification technology, and generating the electronic document format file or the structured data based on the identification result; and/or the presence of a gas in the gas,
acquiring input additional recording information of the files to be audited, and converting the files to be audited into the electronic document format files or the structured data based on the input additional recording information.
8. The method of claim 1,
the audit business information and/or the audit scheme information are used for carrying out audit processing on the files to be audited of the items to be audited;
the method further comprises the following steps:
auditing the files to be audited of the items to be audited by utilizing the incidence relation between the audit business information and/or the audit scheme information in the files to be audited; and/or the presence of a gas in the gas,
acquiring the similarity between the auditing scheme information of each file to be audited in the project to be audited by using a preset similarity algorithm; and taking the files to be audited with the similarity larger than a preset threshold value as abnormal files, and generating an auditing result of the items to be audited.
9. A computer device comprising a memory and a processor coupled to each other, the memory having stored therein program data for execution by the processor to perform the steps of the method of any one of claims 1 to 8.
10. A storage device, characterized by program data stored therein which can be executed by a processor for carrying out the steps of the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110888061.6A CN113626655A (en) | 2021-08-03 | 2021-08-03 | Method for extracting information in file, computer equipment and storage device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110888061.6A CN113626655A (en) | 2021-08-03 | 2021-08-03 | Method for extracting information in file, computer equipment and storage device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113626655A true CN113626655A (en) | 2021-11-09 |
Family
ID=78382554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110888061.6A Pending CN113626655A (en) | 2021-08-03 | 2021-08-03 | Method for extracting information in file, computer equipment and storage device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113626655A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117808441A (en) * | 2024-03-01 | 2024-04-02 | 江苏省港口集团有限公司 | Bid information checking method and system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800113A (en) * | 2021-02-04 | 2021-05-14 | 天津德尔塔科技有限公司 | Bidding auditing method and system based on data mining analysis technology |
-
2021
- 2021-08-03 CN CN202110888061.6A patent/CN113626655A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800113A (en) * | 2021-02-04 | 2021-05-14 | 天津德尔塔科技有限公司 | Bidding auditing method and system based on data mining analysis technology |
Non-Patent Citations (1)
Title |
---|
刘赛,刘小海: "智能时代财务管理转型研究", 吉林人民出版社, pages: 168 - 173 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117808441A (en) * | 2024-03-01 | 2024-04-02 | 江苏省港口集团有限公司 | Bid information checking method and system |
CN117808441B (en) * | 2024-03-01 | 2024-05-10 | 江苏省港口集团有限公司 | Bid information checking method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11574204B2 (en) | Integrity evaluation of unstructured processes using artificial intelligence (AI) techniques | |
Zhaokai et al. | Contract analytics in auditing | |
US9990356B2 (en) | Device and method for analyzing reputation for objects by data mining | |
CN103154991B (en) | Credit risk is gathered | |
US7389306B2 (en) | System and method for processing semi-structured business data using selected template designs | |
CN110119413A (en) | The method and apparatus of data fusion | |
US9171072B2 (en) | System and method for real-time dynamic measurement of best-estimate quality levels while reviewing classified or enriched data | |
US20040205524A1 (en) | Spreadsheet data processing system | |
US11087409B1 (en) | Systems and methods for generating accurate transaction data and manipulation | |
US20090148048A1 (en) | Information classification device, information classification method, and information classification program | |
US11880435B2 (en) | Determination of intermediate representations of discovered document structures | |
CN111553137A (en) | Report generation method and device, storage medium and computer equipment | |
CN111427544B (en) | Software requirement document generation method and device, storage medium and electronic equipment | |
US20150170036A1 (en) | Determining document classification probabilistically through classification rule analysis | |
Sadasivam et al. | Corporate governance fraud detection from annual reports using big data analytics | |
CN112364645A (en) | Method and equipment for automatically auditing ERP financial system business documents | |
Falkner et al. | Identifying requirements in requests for proposal: A research preview | |
JP2014041442A (en) | Receipt definition data preparation device and program | |
Adnan et al. | Beyond Beall's blacklist: automatic detection of open access predatory research journals | |
CN113626655A (en) | Method for extracting information in file, computer equipment and storage device | |
CN113763143A (en) | Auditing processing method, computer equipment and storage device | |
CN113762719A (en) | Text similarity calculation method, computer equipment and storage device | |
JP6279782B1 (en) | Information processing terminal, information processing method, and program | |
TW202018616A (en) | Intelligent accounting system and identification method for accounting documents | |
KR20110093398A (en) | Device and method for managing mobile terminated service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |