CN118312491A - Audit and doubt point library construction method based on unstructured file analysis - Google Patents

Audit and doubt point library construction method based on unstructured file analysis Download PDF

Info

Publication number
CN118312491A
CN118312491A CN202311810937.0A CN202311810937A CN118312491A CN 118312491 A CN118312491 A CN 118312491A CN 202311810937 A CN202311810937 A CN 202311810937A CN 118312491 A CN118312491 A CN 118312491A
Authority
CN
China
Prior art keywords
data
audit
unstructured
entity
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311810937.0A
Other languages
Chinese (zh)
Inventor
何治平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tianyi Shuzhi Technology Co ltd
Original Assignee
Beijing Tianyi Shuzhi Technology Co ltd
Filing date
Publication date
Application filed by Beijing Tianyi Shuzhi Technology Co ltd filed Critical Beijing Tianyi Shuzhi Technology Co ltd
Publication of CN118312491A publication Critical patent/CN118312491A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to an audit and suspects library construction method based on unstructured file analysis, which comprises unstructured data conversion and local or database storage; constructing a data table, designing and constructing a structured data table in a middle analysis layer based on unstructured data conversion results; and constructing an unstructured audit model, performing association analysis on key information extracted from unstructured data and key business information in a business system through a set logic rule, converting the unstructured data into structured data of the business system for fusion analysis, and solidifying the audit model. The method for constructing the audit and suspects library based on unstructured file analysis extracts, optimizes, processes and stores unstructured data of different business systems in an enterprise, realizes centralized data storage and real-time management, supports quick data retrieval, can quickly, efficiently, accurately and conveniently retrieve required data, helps auditors to acquire internal information, and improves the decision level of the enterprise.

Description

Audit and doubt point library construction method based on unstructured file analysis
Technical Field
The invention relates to database construction, in particular to an audit and suspense point library construction method based on unstructured file analysis.
Background
With the development of digital technology, various industries generate a large amount of unstructured data, including text, video, audio, images and other files with different formats, and by mining valuable data, basic data can be provided for an enterprise leadership to study the development direction of enterprises, find problems and decisions existing in the development direction of the enterprises.
At present, a plurality of enterprise business data are in a scattered state and cannot form a unified management mode; the method has the problems of low data retrieval efficiency, non-uniform storage environment, low utilization rate, high technical threshold and the like. Unstructured data is data that is structured irregularly or incompletely, not predefined, and cannot be represented with database two-dimensional logic.
In the field of business, paper materials are continually replaced by electronic information storage means. A large amount of electronic unstructured data exists in the data acquired in daily audit work. The unstructured data conversion technology realizes the storage, identification, unified management and unified search of unstructured data, can not only promote the level capacity of enterprise data management, but also has great research and application values.
OCR (Optical Character Recognition ) is a technique whereby characters printed on paper are inspected by electronic devices (such as scanners or digital cameras), their shape is determined by detecting dark and light patterns, and then the shape is translated into computer text using a character recognition method. OCR technology is widely applied, and is mainly applied to document recognition, certificate recognition, bill recognition, license plate recognition, street view recognition and the like. In terms of document recognition, OCR technology can efficiently recognize various types of paper documents, including document, book, form, description, resume, identity, and the like. In addition, OCR technology can also be applied to fields such as certificate recognition, bill recognition, license plate recognition, street view recognition and the like.
The OCR technology can only identify paper documents, does not process and store the identified data, and is scattered and irregular for the data of the audit items. The converted data keywords cannot be extracted, associated and searched for relevant word senses, and the converted data cannot be automatically subjected to association check, result analysis and data verification. The intermediate table of the generated data cannot be analyzed systematically. On one hand, whether unstructured data are processed or not is related to the comprehensiveness and the integrity of audit contents, and the quality of internal audit is directly affected. On the other hand, whether unstructured data can be effectively processed is related to auditing efficiency and effect, and the effect of internal auditing is directly influenced. The effective unstructured data processing technology can comprehensively bring unstructured data into an audit view, strengthen the data mining strength and ensure the integrity of internal audit contents.
Disclosure of Invention
Aiming at the problems of non-uniform storage mode, diversified formats and large data volume of unstructured data, which cause difficult retrieval, extraction and utilization, the invention provides the audit doubt point library construction method based on unstructured file analysis, which can quickly, efficiently, accurately and conveniently retrieve required data, provide knowledge base accumulation, better help auditors acquire internal information and improve the decision level of enterprises.
The technical scheme adopted by the invention is as follows:
an audit and suspects library construction method based on unstructured file analysis comprises the following steps:
step s100, starting;
Step S101, unstructured data conversion, namely converting an unstructured data document into structured data through an unstructured data processing analysis tool, and storing the structured data in a local or database;
Step S102, constructing a data table, and designing and constructing the structured data table in a middle analysis layer based on unstructured data conversion results;
step S103, constructing an unstructured audit model, and performing association analysis on key information extracted from unstructured data and key service information in a service system through a set logic rule so as to realize deep fusion of the unstructured data and the structured data;
step S104, data analysis, namely writing SQL scripts to convert unstructured data into structured data of a business system for fusion analysis according to audit points of interest, audit rules and data source searching results, and solidifying an audit model;
step s105, end.
The comprehensiveness and the integrity of audit contents are ensured;
ensuring the comprehensiveness and integrity of the audit content is an important task in the audit process.
Preferably, the auditing content of the step s103 includes the following specific steps:
step 201, clearly auditing targets, and clearly auditing targets and ranges; this helps determine what needs to be audited and ensures that the auditing effort covers all relevant fields.
Step S202, a detailed audit plan is formulated, and the detailed audit plan is formulated according to an audit target;
The plan should include schedules for auditing, resource allocation, auditing methods, risk assessment, etc. Ensuring that all factors that may affect the outcome of the audit are considered in the plan.
Step s203, collecting sufficient information, wherein in the auditing process, sufficient information needs to be collected to support auditing conclusion;
including obtaining relevant data, files, records, etc. from both internal and external sources of the company. The accuracy and integrity of the information is ensured for efficient analysis.
Step S204, evaluating the internal control system, wherein auditors need to evaluate the internal control system of the company to determine the validity and the integrity of the system;
the internal control system comprises the contents of the organization structure, responsibility separation, authorization approval process, risk management and the like of the company.
Step s205, identifying potential risks, wherein the potential risks and problems need to be identified in the auditing process;
including issues related to financial reporting, operation, compliance, etc. Identifying risk helps to ensure the integrity of the audit and provides improvement advice to the company.
Step s206, implementing a data analysis tool, wherein the data analysis tool can help auditors to rapidly process and analyze a large amount of data;
Abnormal fluctuation, error or fraudulent behavior can be found through methods such as data mining, trend analysis and the like, so that the integrity of audit is ensured.
Following professional ethics criteria, which need to be followed during the auditing process;
maintaining independence and objectivity. And prejudice or benefit conflict is avoided, and fairness and credibility of the auditing result are ensured.
Step S207, summarizing and reporting, wherein after the audit is completed, the audit result is required to be summarized and reported;
The report should clearly and accurately describe the audit results, including problems found, suggested improvements, etc. The comprehensiveness and integrity of the report are ensured so that the company management layer and other related parties can know and take corresponding measures.
Preferably, the audit steps of illegal purchasing of the special local products and high-grade cigarettes and wines are as follows:
The enterprises check whether staff illegally purchase local products and high-grade cigarettes and wines by extracting expenses generated under the subjects such as service reception fees, propaganda fees, conference fees, other fees and the like in the cost fees.
Step 301, identifying content, namely identifying commodity content such as units and information, name, specification, number, unit price and the like of an invoice issuer by using an unstructured data text identification technology based on XML;
Step S302, extracting information, namely, formulating keyword extraction rules according to different invoice formats, and extracting information in the invoice by using an unstructured data character recognition technology based on XML;
step S303, data processing, namely performing data cleaning and processing on the extracted data table, deleting redundant and error data, and enabling the data to be clear and consistent with the original invoice content;
Step S304, extracting data, writing SQL scripts for a financial invoice related system data background table according to business logic rules, and extracting related system record data;
Step s305, a model is built, and the invoice data after data processing and the data stored by the service system are processed according to the service rules, such as: the business should take place time, business attribute, voucher abstract, invoice purchase record, etc., set up threshold value, price fluctuation range, purchase frequency, etc., through writing SQL script, etc., build the audit model;
Step S306, outputting a result, and checking whether the invoice issuer contains relevant information; for example: the words of "trade business", "specialty", "tobacco and wine", "trade company" and the like pay attention to whether the commodity name contains relevant information, for example: the words of smoke, wine and gift box;
Particularly, in the practice of multiple audit projects, some invisible variation phenomena of partial basic units are found, such as making up invoices of 'purchasing articles', 'purchasing water', or 'office articles', 'stationery batch', and 'purchasing tea', and the like, so as to mask transaction behavior information of actual purchase.
Step s307, executing an audit plan, wherein after the audit personnel conduct identification, classification and analysis on the original voucher invoice information, qualitative and audit identification can be conducted on real transaction behaviors through peripheral investigation, consultation interview, assault inventory real object and observation methods;
Preferably, the unstructured data conversion is a process of analyzing, identifying and processing the image file, acquiring text and layout information and translating the text and layout information into computer text; identifying the characters in the scanned document, and outputting and applying the characters in a text form;
after the entity and relation extraction and identification are completed, storing the entity and relation in json format, and requesting entity data from the back end when a user inquires about the entity relation; the unstructured text recognition result retrieval steps are as follows:
Step s400, starting;
Step s401, a user selects a model and obtains model data; that is, all entity relationship data for the user selected model;
Step S402, judging whether entity and relationship exist in the current model
If no entity and relationship exist in the currently selected model, step s203 is skipped, otherwise step s204 is skipped;
step s403, reminding the user to reselect the model;
Step s404, obtaining entity and relationship data;
The back-end and front-end are two important components in a computer software system. The backend is mainly responsible for processing tasks such as business logic, data storage, computation and the like, and is usually written in a programming language such as Java and the like. The front end is mainly responsible for user interface and user interaction, and is usually realized by using technologies such as HTML, CSS, javaScript and the like. In the Web application, the back end and the front end communicate via HTTP protocol, the front end sends a request to the back end, the back end processes the request and returns response data, and the front end updates the user interface according to the response data.
Step s405, sending an ajax request to the back end, wherein the data returned to the front end by the back end is json array; each item is a relationship; querying entity relationships;
Step s406, inquiring the entity relation, and inputting the entity Chinese name;
Step s407, selecting a search type; selecting one-degree or two-degree query, wherein the one-degree query refers to only showing nodes directly related to a target entity, the two-degree query shows all entities directly or indirectly related to the target entity, and corresponding results can be obtained by clicking the query after all the input is completed;
Step S408, establishing an adjacency matrix table;
step s409, breadth-first traversal;
Step s410, determining whether all relationships are displayed; demonstrating all relationships to jump to step s411, otherwise jump to step s412;
step s411, obtaining nodes and edges having direct or indirect relation with the target relation; step s413 is skipped;
Step s412, obtaining all traversed nodes and edges; step s413 is skipped;
step s413, displaying a search result;
step s414, end.
Preferably, the step s201 model is built according to service requirements, including entity relationships, data table structures, field definitions, etc.;
the model data comprises entity and relation data, and is obtained through unstructured data conversion;
the unstructured data conversion is to convert an unstructured data document into structured data through an unstructured data processing analysis tool, and then design and construct a structured data table in a middle analysis layer according to a conversion result;
The model construction comprises the following specific steps:
step s501, determining an audit range and content, and determining a business flow, a risk field and an organization unit which need to carry out digital internal audit according to a strategic target, a risk assessment and an audit plan of an enterprise; the audit targets and the contents to be audited, such as financial statement, internal control system, etc., are defined.
Step S502, collecting integrated audit data, acquiring related data from each business system and platform, and integrating the data onto a unified data analysis platform through data cleaning, conversion and standardization methods; such as financial reports, business data, internal management regimes, etc.
Step s503, data collection and analysis, which is to sort and analyze the processed data and information to find the abnormality, trend and relationship existing in the processed data and information;
step s504, identifying entities and relationships, and identifying the entities and relationships by analyzing the data and information;
An entity generally refers to an object, such as a person, organization, product, etc., that can be explicitly identified, while a relationship refers to a manner of contact or interaction between entities, such as a parent-child relationship, a colleague relationship, a trade relationship, etc.
Step S505, an audit model is established, and a corresponding audit model is established according to the identified entity and relationship;
the audit model may include various metrics, thresholds, and algorithms for discovering anomalies, trends, and relationships present in the data and providing risk assessment and advice.
And step S506, verifying and testing, wherein the established audit model is verified and tested to ensure the effectiveness, accuracy and reliability of the audit model.
Preferably, after the entity and relation extraction and identification are completed, the entity and relation are stored in json format, and when a user needs to inquire about the entity relation, entity data is requested from the back end;
the user selects a model and acquires model data, namely all entity relation data of the model selected by the user, if the currently selected model does not have an entity and a relation, the user is reminded to reselect, if the currently selected model has the entity relation data, an ajax request is sent to the rear end, the data returned to the front end by the rear end is a json array, each item is a relation, and then the entity relation is inquired;
firstly, inputting a Chinese name of an entity to be queried, selecting a first-degree query and a second-degree query, wherein the first-degree query refers to only showing nodes directly related to a target entity, the second-degree query shows all entities directly or indirectly related to the target entity, and after all the input is completed, clicking the query can obtain a corresponding result;
The entity relation inquiry establishes an adjacency matrix table according to json data returned by a background, then uses a queue to carry out breadth-first traversal, then stores all traversed nodes and edges in an array form respectively, the stored format needs to meet the format requirement of echorts, wherein all node information needs to be stored, and the relation is stored, and whether the relation is stored is determined according to whether a user selects and displays all the relations or not;
During the modeling process, patterns and relations existing in the data are discovered through exploratory analysis and visualization of the data;
In short, the digital audit modeling needs to combine audit targets and actual conditions, and scientific methods and technical means are adopted to judge whether the model has entities and relations.
Preferably, the entity relation query is realized by firstly establishing an adjacency matrix table according to json data returned by a background, then performing breadth-first traversal by using a queue, and then respectively storing all traversed nodes and edges in an array form, wherein the format of the storage needs to meet the format requirement of echorts, all node information needs to be stored, and the storage of the relation needs to be determined whether to store according to whether a user selects and displays all the relation or not.
An audit and suspects library construction method based on unstructured file analysis comprises the following steps: the method comprises unstructured data acquisition, unstructured data structuring processing, data analysis and output of suspicious points;
unstructured data acquisition, wherein each unit construction contract and corresponding winning notice are downloaded in batches in a transit system;
The unstructured data is structured, and the downloaded construction contract and the bid-winning notice are unstructured data files;
Configuring data extraction rules in an unstructured data processing analysis tool, and writing construction type contracts and winning bid notification data extraction regular expressions;
then uploading the unstructured data files in batches, and automatically identifying and extracting by a tool according to configured rules;
Finally, outputting the structured data extracted from the construction type contract and the winning bid notice in a form of a table;
after the structured data is successfully converted, a contract detail table and a bid-winning notice information table are built in the database, and unstructured data conversion results are inserted;
And (3) data analysis, namely developing an audit model by using SQL programming language, and converting unstructured data into structured data in an ERP system for joint analysis according to audit rules.
The rule-based extraction mode is an unstructured recognition technology based on rule-based extraction data, and usually, unstructured data are analyzed and processed according to certain rules and modes, key information in the unstructured data is extracted, and the key information is converted into structured data. These rules and patterns may be template-based methods, trigger word-based patterns, dependency syntax analysis-based patterns, and the like. Therefore, in the unstructured data processing analysis tool, a regular expression can be written according to a preset data extraction rule to extract key information in an unstructured data file. This mode can help auditors quickly and accurately extract key information from unstructured data and convert it to structured data for further analysis and processing.
Compared with the prior art, the invention has the beneficial effects that:
According to the method for constructing the audit and suspects library based on unstructured file analysis, unstructured data of different business systems in an enterprise are extracted, optimized, processed and stored, so that the data are stored in a centralized mode and managed in real time, quick retrieval of the data is supported, required data can be retrieved by audit related personnel quickly, efficiently, accurately and conveniently, knowledge base accumulation is provided, the audit personnel can be better helped to acquire internal information, and the decision level of the enterprise is improved.
The method realizes the conversion of unstructured data to structured data, the retrieval, screening and association of the unstructured data, combines the digital audit requirements, constructs a corresponding business audit model based on various audit businesses, is based on unstructured data word recognition technology in various business fields such as engineering, marketing, finance, materials, human resources and the like, accurately analyzes data information in the unstructured document by extracting unstructured documents of various links of various systems of each specialty, further enriches the digital audit means, expands the digital audit business data range, and improves the audit business processing efficiency and the utilization rate.
Drawings
FIG. 1 is a control flow diagram of an audit trail library construction method based on unstructured file parsing;
FIG. 2 is a flow chart of unstructured text recognition result retrieval based on an audit and suspects library construction method of unstructured file parsing;
FIG. 3 is a flowchart of the audit steps of the illegal purchasing of special property and high-grade tobacco and wine based on the audit point library construction method of unstructured file parsing;
FIG. 4 is a flowchart of the unstructured text recognition result retrieval step of the audit and suspense construction method based on unstructured file parsing;
fig. 5 is a model building flow chart of an audit trail library construction method based on unstructured file parsing.
Detailed Description
The invention is described in detail below with reference to the attached drawings and examples:
as shown in FIG. 1, an audit trail point library construction method based on unstructured file analysis comprises the following steps:
step s100, starting;
Step S101, unstructured data conversion, namely converting an unstructured data document into structured data through an unstructured data processing analysis tool, and storing the structured data in a local or database;
Step S102, constructing a data table, and designing and constructing the structured data table in a middle analysis layer based on unstructured data conversion results;
The data center is an enterprise business data resource pool which is deployed on private cloud constructed by enterprises and is a unified business data integration analysis outlet, mass data are collected, calculated, stored and processed through a data technology, and meanwhile, standards and calibers are unified. After unifying the data, the data center station can form standard data, and then store the standard data to form a large data resource layer, wherein the data has strong relevance with business of enterprises, is unique to the enterprises and can be reused.
The structured data table is designed at the middle stage analysis layer, and the following key steps are generally considered:
1. The service requirement is clarified, the service scene, the data source and the analysis target are known; these data are processed and analyzed.
2. Determining a data model: determining a data model according to service requirements, wherein the data model comprises entity relationships, data table structures and field definitions;
The data model should be able to accurately reflect the business scenario and be easy to understand and maintain.
3. And constructing a data table structure, and after the data model is determined, constructing the data table structure. This includes determining the name of the data table, field name, field type, field constraints, etc. Meanwhile, consideration needs to be given to how to store and access data and how to ensure the security and consistency of the data.
4. And realizing a data table according to the designed data table structure. Including data screening, data modeling, data storage, etc. Meanwhile, the integrity and the safety of the data are required to be ensured, and the loss or damage of the data is avoided.
And (3) data screening, namely processing unstructured data into a data table format required by audit professions after extracting the unstructured data, screening out required data fields, and cleaning and arranging according to the data types, the data quality and the like.
And (3) data modeling, namely establishing a proper data model according to the service requirements and the data characteristics. The data storage modes such as a relational database or a data warehouse can be selected and used, and elements such as table structures, constraints, indexes and the like can be defined.
And storing the cleaned and tidied data into a designated data storage device, wherein the data storage scale can be expanded by using technologies such as cloud storage, a distributed file system and the like.
5. In order to ensure the safety of the data, the data needs to be encrypted, backed up, controlled in authority and the like so as to ensure that the data is not revealed, damaged and illegally accessed.
6. Optimizing performance, optimizing data tables for improving query and report performance, including establishing indexes, optimizing query statements, using caching techniques, and the like.
7. Testing and verification after the data table is implemented, testing and verification is required. This includes testing the accuracy, integrity and consistency of the data, as well as verifying the correctness of queries and reports.
8. And deploying and maintaining the data table in a production environment, wherein the data table is subjected to maintenance, including monitoring the performance of the data table, periodically backing up data, repairing faults and the like.
In building structured data tables, attention is also paid to the following:
A suitable data storage technique is selected, such as a relational database or a non-relational database.
The scalability and maintainability of the data are considered so that new fields can be easily added or existing structures can be adjusted in the future.
The accuracy and consistency of the data are ensured, and the data are prevented from being lost or damaged.
Query performance is optimized for rapid acquisition and analysis of data.
In short, the structural data table is designed and constructed in the middle analysis layer, so that factors such as service requirements, data models, data table structures, queries, reports and the like are comprehensively considered, and the accuracy and the performance of the data are ensured.
Step S103, constructing an unstructured audit model, and performing association analysis on key information extracted from unstructured data and key service information in a service system through a set logic rule so as to realize deep fusion of the unstructured data and the structured data;
step S104, data analysis, namely writing SQL scripts to convert unstructured data into structured data of a business system for fusion analysis according to audit points of interest, audit rules and data source searching results, and solidifying an audit model;
step s105, end.
The comprehensiveness and the integrity of audit contents are ensured;
ensuring the comprehensiveness and integrity of the audit content is an important task in the audit process.
Preferably, the auditing content of the step s103 includes the following specific steps:
FIG. 2, step s201, explicitly auditing targets and explicitly auditing ranges; this helps determine what needs to be audited and ensures that the auditing effort covers all relevant fields.
Step S202, a detailed audit plan is formulated, and the detailed audit plan is formulated according to an audit target;
The plan should include schedules for auditing, resource allocation, auditing methods, risk assessment, etc. Ensuring that all factors that may affect the outcome of the audit are considered in the plan.
Step s203, collecting sufficient information, wherein in the auditing process, sufficient information needs to be collected to support auditing conclusion;
including obtaining relevant data, files, records, etc. from both internal and external sources of the company. The accuracy and integrity of the information is ensured for efficient analysis.
Step S204, evaluating the internal control system, wherein auditors need to evaluate the internal control system of the company to determine the validity and the integrity of the system;
the internal control system comprises the contents of the organization structure, responsibility separation, authorization approval process, risk management and the like of the company.
Step s205, identifying potential risks, wherein the potential risks and problems need to be identified in the auditing process;
including issues related to financial reporting, operation, compliance, etc. Identifying risk helps to ensure the integrity of the audit and provides improvement advice to the company.
Step s206, implementing a data analysis tool, wherein the data analysis tool can help auditors to rapidly process and analyze a large amount of data;
Abnormal fluctuation, error or fraudulent behavior can be found through methods such as data mining, trend analysis and the like, so that the integrity of audit is ensured.
Following professional ethics criteria, which need to be followed during the auditing process;
maintaining independence and objectivity. And prejudice or benefit conflict is avoided, and fairness and credibility of the auditing result are ensured.
Step S207, summarizing and reporting, wherein after the audit is completed, the audit result is required to be summarized and reported;
The report should clearly and accurately describe the audit results, including problems found, suggested improvements, etc. The comprehensiveness and integrity of the report are ensured so that the company management layer and other related parties can know and take corresponding measures.
Preferably, the audit steps of illegal purchasing of the special local products and high-grade cigarettes and wines are as follows:
The enterprises check whether staff illegally purchase local products and high-grade cigarettes and wines by extracting expenses generated under the subjects such as service reception fees, propaganda fees, conference fees, other fees and the like in the cost fees.
FIG. 3, step s301, identifying content, namely identifying commodity content such as units and information, name, specification, number, unit price and the like of an invoice issuer by using an unstructured data word identification technology based on XML;
Step S302, extracting information, namely, formulating keyword extraction rules according to different invoice formats, and extracting information in the invoice by using an unstructured data character recognition technology based on XML;
step S303, data processing, namely performing data cleaning and processing on the extracted data table, deleting redundant and error data, and enabling the data to be clear and consistent with the original invoice content;
Step S304, extracting data, writing SQL scripts for a financial invoice related system data background table according to business logic rules, and extracting related system record data;
Step s305, a model is built, and the invoice data after data processing and the data stored by the service system are processed according to the service rules, such as: the business should take place time, business attribute, voucher abstract, invoice purchase record, etc., set up threshold value, price fluctuation range, purchase frequency, etc., through writing SQL script, etc., build the audit model;
Step S306, outputting a result, and checking whether the invoice issuer contains relevant information; for example: the words of "trade business", "specialty", "tobacco and wine", "trade company" and the like pay attention to whether the commodity name contains relevant information, for example: the words of smoke, wine and gift box;
Particularly, in the practice of multiple audit projects, some invisible variation phenomena of partial basic units are found, such as making up invoices of 'purchasing articles', 'purchasing water', or 'office articles', 'stationery batch', and 'purchasing tea', and the like, so as to mask transaction behavior information of actual purchase.
Step s307, executing an audit plan, wherein after the audit personnel conduct identification, classification and analysis on the original voucher invoice information, qualitative and audit identification can be conducted on real transaction behaviors through peripheral investigation, consultation interview, assault inventory real object and observation methods;
Preferably, the unstructured data conversion is a process of analyzing, identifying and processing the image file, acquiring text and layout information and translating the text and layout information into computer text; identifying the characters in the scanned document, and outputting and applying the characters in a text form;
after the entity and relation extraction and identification are completed, storing the entity and relation in json format, and requesting entity data from the back end when a user inquires about the entity relation; the unstructured text recognition result retrieval steps are as follows:
FIG. 4, step s400, begins;
Step s401, a user selects a model and obtains model data; that is, all entity relationship data for the user selected model;
Step S402, judging whether entity and relationship exist in the current model
If no entity and relationship exist in the currently selected model, step s203 is skipped, otherwise step s204 is skipped;
step s403, reminding the user to reselect the model;
Step s404, obtaining entity and relationship data;
The back-end and front-end are two important components in a computer software system. The backend is mainly responsible for processing tasks such as business logic, data storage, computation and the like, and is usually written in a programming language such as Java and the like. The front end is mainly responsible for user interface and user interaction, and is usually realized by using technologies such as HTML, CSS, javaScript and the like. In the Web application, the back end and the front end communicate via HTTP protocol, the front end sends a request to the back end, the back end processes the request and returns response data, and the front end updates the user interface according to the response data.
Step s405, sending an ajax request to the back end, wherein the data returned to the front end by the back end is json array; each item is a relationship; querying entity relationships;
Step s406, inquiring the entity relation, and inputting the entity Chinese name;
Step s407, selecting a search type; selecting one-degree or two-degree query, wherein the one-degree query refers to only showing nodes directly related to a target entity, the two-degree query shows all entities directly or indirectly related to the target entity, and corresponding results can be obtained by clicking the query after all the input is completed;
Step S408, establishing an adjacency matrix table;
step s409, breadth-first traversal;
Step s410, determining whether all relationships are displayed; demonstrating all relationships to jump to step s411, otherwise jump to step s412;
step s411, obtaining nodes and edges having direct or indirect relation with the target relation; step s413 is skipped;
Step s412, obtaining all traversed nodes and edges; step s413 is skipped;
step s413, displaying a search result;
step s414, end.
Preferably, the step s201 model is built according to service requirements, including entity relationships, data table structures, field definitions, etc.;
the model data comprises entity and relation data, and is obtained through unstructured data conversion;
the unstructured data conversion is to convert an unstructured data document into structured data through an unstructured data processing analysis tool, and then design and construct a structured data table in a middle analysis layer according to a conversion result;
The model construction comprises the following specific steps:
FIG. 5, step S501, determining audit scope and content, and determining business processes, risk areas and organization units needing digital internal audit according to strategic targets, risk assessment and audit plans of enterprises; the audit targets and the contents to be audited, such as financial statement, internal control system, etc., are defined.
Step S502, collecting integrated audit data, acquiring related data from each business system and platform, and integrating the data onto a unified data analysis platform through data cleaning, conversion and standardization methods; such as financial reports, business data, internal management regimes, etc.
Step s503, data collection and analysis, which is to sort and analyze the processed data and information to find the abnormality, trend and relationship existing in the processed data and information;
step s504, identifying entities and relationships, and identifying the entities and relationships by analyzing the data and information;
An entity generally refers to an object, such as a person, organization, product, etc., that can be explicitly identified, while a relationship refers to a manner of contact or interaction between entities, such as a parent-child relationship, a colleague relationship, a trade relationship, etc.
Step S505, an audit model is established, and a corresponding audit model is established according to the identified entity and relationship;
the audit model may include various metrics, thresholds, and algorithms for discovering anomalies, trends, and relationships present in the data and providing risk assessment and advice.
And step S506, verifying and testing, wherein the established audit model is verified and tested to ensure the effectiveness, accuracy and reliability of the audit model.
Preferably, after the entity and relation extraction and identification are completed, the entity and relation are stored in json format, and when a user needs to inquire about the entity relation, entity data is requested from the back end;
the user selects a model and acquires model data, namely all entity relation data of the model selected by the user, if the currently selected model does not have an entity and a relation, the user is reminded to reselect, if the currently selected model has the entity relation data, an ajax request is sent to the rear end, the data returned to the front end by the rear end is a json array, each item is a relation, and then the entity relation is inquired;
firstly, inputting a Chinese name of an entity to be queried, selecting a first-degree query and a second-degree query, wherein the first-degree query refers to only showing nodes directly related to a target entity, the second-degree query shows all entities directly or indirectly related to the target entity, and after all the input is completed, clicking the query can obtain a corresponding result;
The entity relation inquiry establishes an adjacency matrix table according to json data returned by a background, then uses a queue to carry out breadth-first traversal, then stores all traversed nodes and edges in an array form respectively, the stored format needs to meet the format requirement of echorts, wherein all node information needs to be stored, and the relation is stored, and whether the relation is stored is determined according to whether a user selects and displays all the relations or not;
During the modeling process, patterns and relations existing in the data are discovered through exploratory analysis and visualization of the data;
In short, the digital audit modeling needs to combine audit targets and actual conditions, and scientific methods and technical means are adopted to judge whether the model has entities and relations.
Preferably, the entity relation query is realized by firstly establishing an adjacency matrix table according to json data returned by a background, then performing breadth-first traversal by using a queue, and then respectively storing all traversed nodes and edges in an array form, wherein the format of the storage needs to meet the format requirement of echorts, all node information needs to be stored, and the storage of the relation needs to be determined whether to store according to whether a user selects and displays all the relation or not.
An audit and suspects library construction method based on unstructured file analysis comprises the following steps: the method comprises unstructured data acquisition, unstructured data structuring processing, data analysis and output of suspicious points;
unstructured data acquisition, wherein each unit construction contract and corresponding winning notice are downloaded in batches in a transit system;
The unstructured data is structured, and the downloaded construction contract and the bid-winning notice are unstructured data files;
Configuring data extraction rules in an unstructured data processing analysis tool, and writing construction type contracts and winning bid notification data extraction regular expressions;
then uploading the unstructured data files in batches, and automatically identifying and extracting by a tool according to configured rules;
Finally, outputting the structured data extracted from the construction type contract and the winning bid notice in a form of a table;
after the structured data is successfully converted, a contract detail table and a bid-winning notice information table are built in the database, and unstructured data conversion results are inserted;
And (3) data analysis, namely developing an audit model by using SQL programming language, and converting unstructured data into structured data in an ERP system for joint analysis according to audit rules.
The rule-based extraction mode is an unstructured recognition technology based on rule-based extraction data, and usually, unstructured data are analyzed and processed according to certain rules and modes, key information in the unstructured data is extracted, and the key information is converted into structured data. These rules and patterns may be template-based methods, trigger word-based patterns, dependency syntax analysis-based patterns, and the like. Therefore, in the unstructured data processing analysis tool, a regular expression can be written according to a preset data extraction rule to extract key information in an unstructured data file. This mode can help auditors quickly and accurately extract key information from unstructured data and convert it to structured data for further analysis and processing.
Audit rules;
A. comparing the construction type contract with the key information of the bid-closing notice, inquiring whether the contract is signed according to the bid-closing result, and judging whether the contract is signed in time;
B. The construction contract key information is compared with the ERP project purchasing and payment information to check whether purchasing and payment are strictly performed according to the contract.
Unstructured data conversion results include: a contract detail table, a winning bid notification information table;
The ERP system business data table comprises: a bid winning case table, a contract table, a purchase order table, a project definition table, a WBS element table, and a payment information table.
According to the method for constructing the audit and suspects library based on unstructured file analysis, unstructured data of different business systems in an enterprise are extracted, optimized, processed and stored, so that the data are stored in a centralized mode and managed in real time, quick retrieval of the data is supported, required data can be retrieved by audit related personnel quickly, efficiently, accurately and conveniently, knowledge base accumulation is provided, the audit personnel can be better helped to acquire internal information, and the decision level of the enterprise is improved.
The method realizes the conversion of unstructured data to structured data, the retrieval, screening and association of the unstructured data, combines the digital audit requirements, constructs a corresponding business audit model based on various audit businesses, is based on unstructured data word recognition technology in various business fields such as engineering, marketing, finance, materials, human resources and the like, accurately analyzes data information in the unstructured document by extracting unstructured documents of various links of various systems of each specialty, further enriches the digital audit means, expands the digital audit business data range, and improves the audit business processing efficiency and the utilization rate.
JSON (JavaScriptObjectNotation) is a commonly used data exchange format that represents structured data in text form. The JSON format consists of key-value pairs, where keys are strings, and values may be strings, numbers, boolean values, arrays, objects, or null.
And activating the value of the data of the deep sleep, and further expanding audit analysis content. In the traditional audit mode, the internal auditor can only analyze the structured data in the core business system and the associated information system, but in the huge data repository, only a few are structured data, and most are unstructured data information such as images, voice, video and the like.
Along with the rapid development of the business scale and the continuous penetration of the electronic operation degree, a large amount of paper materials generated in business links such as engineering audit, financial audit and the like are converted into electronic documents through scanning equipment and stored in an image system, and the image system accumulates massive unstructured information which is not converted into effective data for the utilization of internal audit work.
Unstructured data in an image system can be identified into text data to be output through an unstructured data text identification technology based on XML, and an audit analysis object is formed. Taking the image of the expense reimbursement system as an example, the content and the range of audit analysis are further expanded and extended by fully mining the content and the data value of the information of the ticket face of the value-added tax invoice, so that the audit value is effectively improved.
The dimension of the risk data is widened, the perception and recognition of risks by the internal audit at the current stage of the early warning rule are further enriched, the risk data are extracted mainly by means of an audit auxiliary system, and because the extraction rule of the risk data is mainly based on the structured data, the information contained in the unstructured data is lacking, and the risk overview is difficult to be reversed. Taking the authenticity and compliance of the cost list as an example, at present, a method of 'subjects and certificates' is mainly adopted for sampling and screening, and the risk found by audit can only reflect the transaction on the 'point' of a single service or a single institution, and the dimension of risk data is relatively single. An unstructured data word recognition technology based on XML is introduced, global and full invoice image recognition is converted into Excel electronic data, the converted data and the existing structured data of the reimbursement system are subjected to association matching, boundaries of business and management institutions are opened in the technical level, early warning rules are enriched, the problem of 'face' is found by internal auditing personnel, and audit clues are further searched.
The efficiency and the accuracy of off-site audit are improved, and the audit risk is further reduced. Audit sampling is the primary job of auditors in the offsite phase. Under the traditional audit mode, two main methods are used for carrying out off-site analysis and extracting audit samples: firstly, extracting risk data by combining risk points found by the conventional audit inspection; and secondly, manually judging, screening and extracting and sampling according to experience by extracting a full service list of an audit interval. After an audit sample is selected, an auditor needs to log in a related system to check each time, and after the auditor knows the basic facts, the auditor is combined with the on-site audit to carry out verification and confirmation. Under the condition of low traffic, the traditional audit sampling method is effective, but with the rapid increase of traffic scale, the sampling technology which depends on a small amount of data samples exposes limitations, and has a certain risk for auditors, if the auditors want to reduce the sampling risk, only the sampled samples can be increased. While reducing audit risk by increasing sample size is clearly impractical for time and labor cost considerations. The artificial intelligence technology including the unstructured data text recognition based on XML is applied to solve audit sampling dilemma under massive data, and is an effective way for promoting the deep transformation of offsite audit. Artificial intelligence enables auditors to review all data from which they can obtain relevant information without being limited to relying on a small number of data samples, but rather on audit of the overall sample.
And data resources are reserved, so that the audit informatization level is further improved. In the artificial intelligence era, the internal audit is not just error checking and cheating, and the method can more closely surround the development bureau of enterprises, is based on the creation of value and promotes the development transformation of high quality. Unstructured intelligent conversion can collect, mine, generalize and deeply analyze mass data by utilizing own advantages, and provide prospective audit suggestions from a higher-level, wider-range and more comprehensive view, which is a process that variable data is resources and becomes intelligent. The construction of an intelligent audit system based on an artificial intelligence technology becomes a trend of future internal audit informatization construction. On one hand, the intelligent analysis model of the multi-dimensional risk data is pushed to be constructed by combining the application of machine learning by taking data formed by an unstructured data word recognition technology based on XML and structured data of a system as resources; on the other hand, by combining the big data technology and carrying out association analysis on the internal data and the external data, more audit evidences can be collected, and the audit value is further improved.
The above description is only of the preferred embodiment of the present invention, and is not intended to limit the structure of the present invention in any way. Any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention fall within the technical scope of the present invention.

Claims (8)

1. An audit and suspects library construction method based on unstructured file analysis is characterized by comprising the following steps:
step s100, starting;
Step S101, unstructured data conversion, namely converting an unstructured data document into structured data through an unstructured data processing analysis tool, and storing the structured data in a local or database;
Step S102, constructing a data table, and designing and constructing the structured data table in a middle analysis layer based on unstructured data conversion results;
step S103, constructing an unstructured audit model, and performing association analysis on key information extracted from unstructured data and key service information in a service system through a set logic rule so as to realize deep fusion of the unstructured data and the structured data;
step S104, data analysis, namely writing SQL scripts to convert unstructured data into structured data of a business system for fusion analysis according to audit points of interest, audit rules and data source searching results, and solidifying an audit model;
step s105, end.
2. The method for constructing an audit trail point library based on unstructured file parsing according to claim 1, wherein the method comprises the following steps:
The audit content of the step s103 comprises the following specific steps:
step 201, clearly auditing targets, and clearly auditing targets and ranges;
step S202, a detailed audit plan is formulated, and the detailed audit plan is formulated according to an audit target;
Step s203, collecting sufficient information, wherein in the auditing process, sufficient information needs to be collected to support auditing conclusion;
Step S204, evaluating the internal control system, wherein auditors need to evaluate the internal control system of the company to determine the validity and the integrity of the system;
step s205, identifying potential risks, wherein the potential risks and problems need to be identified in the auditing process;
Step s206, implementing a data analysis tool, wherein the data analysis tool can help auditors to rapidly process and analyze a large amount of data;
Step s207, summarizing and reporting, wherein after the audit is completed, the audit result needs to be summarized and reported.
3. The method for constructing the audit trail point library based on unstructured file parsing according to claim 2, wherein the method comprises the following steps:
The audit steps of the illegal purchasing of the special local products and the high-grade cigarettes and wines are as follows:
step 301, identifying content, namely identifying units and information, names, specifications, quantity and unit price commodity content of an invoice issuer by using an unstructured data character identification technology based on XML;
Step S302, extracting information, namely, formulating keyword extraction rules according to different invoice formats, and extracting information in the invoice by using an unstructured data character recognition technology based on XML;
step S303, data processing, namely performing data cleaning and processing on the extracted data table, deleting redundant and error data, and enabling the data to be clear and consistent with the original invoice content;
Step S304, extracting data, writing SQL scripts for a financial invoice related system data background table according to business logic rules, and extracting related system record data;
step S305, constructing a model, namely constructing an audit model by writing SQL scripts according to business rules between invoice data after data processing and data stored in a business system;
step S306, outputting a result, and checking whether the invoice issuer contains relevant information; focusing on whether the commodity name contains relevant information;
step s307, executing an audit plan, and after the auditor performs identification, classification and analysis on the original voucher invoice information, performing qualitative and audit identification on the real transaction behavior through peripheral investigation, consultation interview, assault inventory real object and observation method.
4. The method for constructing an audit trail point library based on unstructured file parsing according to claim 1, wherein the method comprises the following steps:
The unstructured data conversion is a process of analyzing, identifying and processing an image file, acquiring text and layout information and translating the text and the layout information into computer text; identifying the characters in the scanned document, and outputting and applying the characters in a text form;
after the entity and relation extraction and identification are completed, storing the entity and relation in json format, and requesting entity data from the back end when a user inquires about the entity relation; the unstructured text recognition result retrieval steps are as follows:
Step s400, starting;
Step s401, a user selects a model and obtains model data; that is, all entity relationship data for the user selected model;
Step S402, judging whether entity and relationship exist in the current model
If no entity and relationship exist in the currently selected model, step s203 is skipped, otherwise step s204 is skipped;
step s403, reminding the user to reselect the model;
Step s404, obtaining entity and relationship data;
Step s405, sending an ajax request to the back end, wherein the data returned to the front end by the back end is json array; each item is a relationship; querying entity relationships;
Step s406, inquiring the entity relation, and inputting the entity Chinese name;
Step s407, selecting a search type; selecting one-degree or two-degree query, wherein the one-degree query refers to only showing nodes directly related to a target entity, the two-degree query shows all entities directly or indirectly related to the target entity, and corresponding results can be obtained by clicking the query after all the input is completed;
Step S408, establishing an adjacency matrix table;
step s409, breadth-first traversal;
Step s410, determining whether all relationships are displayed; demonstrating all relationships to jump to step s411, otherwise jump to step s412;
step s411, obtaining nodes and edges having direct or indirect relation with the target relation; step s413 is skipped;
Step s412, obtaining all traversed nodes and edges; step s413 is skipped;
step s413, displaying a search result;
step s414, end.
5. The method for constructing an audit trail point library based on unstructured file parsing according to claim 4, wherein:
The step s201 model is built according to service requirements, and comprises entity relation, data table structure and field definition;
the model data comprises entity and relation data, and is obtained through unstructured data conversion;
the unstructured data conversion is to convert an unstructured data document into structured data through an unstructured data processing analysis tool, and then design and construct a structured data table in a middle analysis layer according to a conversion result;
The model construction comprises the following specific steps:
Step s501, determining an audit range and content, and determining a business flow, a risk field and an organization unit which need to carry out digital internal audit according to a strategic target, a risk assessment and an audit plan of an enterprise;
step S502, collecting integrated audit data, acquiring related data from each business system and platform, and integrating the data onto a unified data analysis platform through data cleaning, conversion and standardization methods;
Step s503, data collection and analysis, which is to sort and analyze the processed data and information to find the abnormality, trend and relationship existing in the processed data and information;
step s504, identifying entities and relationships, and identifying the entities and relationships by analyzing the data and information;
step S505, an audit model is established, and a corresponding audit model is established according to the identified entity and relationship;
And step S506, verifying and testing, wherein the established audit model is verified and tested to ensure the effectiveness, accuracy and reliability of the audit model.
6. The method for constructing an audit trail point library based on unstructured file parsing according to claim 4, wherein:
after the entity and relation extraction and identification are completed, storing the entity and relation in json format, and requesting entity data from the back end when a user needs to inquire about the entity relation;
the user selects a model and acquires model data, namely all entity relation data of the model selected by the user, if the currently selected model does not have an entity and a relation, the user is reminded to reselect, if the currently selected model has the entity relation data, an ajax request is sent to the rear end, the data returned to the front end by the rear end is a json array, each item is a relation, and then the entity relation is inquired;
firstly, inputting a Chinese name of an entity to be queried, selecting a first-degree query and a second-degree query, wherein the first-degree query refers to only showing nodes directly related to a target entity, the second-degree query shows all entities directly or indirectly related to the target entity, and after all the input is completed, clicking the query can obtain a corresponding result;
And establishing an adjacency matrix table according to json data returned by the background by entity relation inquiry, performing breadth-first traversal by using a queue, and then respectively storing all traversed nodes and edges in an array form, wherein the stored format needs to meet the format requirement of echorts, all node information needs to be stored, and the storage of the relation needs to be determined whether to store according to whether a user selects and displays all the relations or not.
7. The method for constructing an audit trail point library based on unstructured file parsing according to claim 4, wherein:
The method comprises the steps of firstly establishing an adjacency matrix table according to json data returned by a background, then performing breadth-first traversal by using a queue, and then respectively storing all traversed nodes and edges in an array form, wherein the format requirement of the storage needs to meet the format requirement of echorts, all node information needs to be stored, and the storage of the relationship needs to be determined whether to store according to whether a user selects and displays all the relationship or not.
8. An audit and suspects library construction method based on unstructured file analysis is characterized by comprising the following steps: the method comprises unstructured data acquisition, unstructured data structuring processing, data analysis and output of suspicious points;
unstructured data acquisition, wherein each unit construction contract and corresponding winning notice are downloaded in batches in a transit system;
The unstructured data is structured, and the downloaded construction contract and the bid-winning notice are unstructured data files;
Configuring data extraction rules in an unstructured data processing analysis tool, and writing construction type contracts and winning bid notification data extraction regular expressions;
then uploading the unstructured data files in batches, and automatically identifying and extracting by a tool according to configured rules;
Finally, outputting the structured data extracted from the construction type contract and the winning bid notice in a form of a table;
after the structured data is successfully converted, a contract detail table and a bid-winning notice information table are built in the database, and unstructured data conversion results are inserted;
And (3) data analysis, namely developing an audit model by using SQL programming language, and converting unstructured data into structured data in an ERP system for joint analysis according to audit rules.
CN202311810937.0A 2023-12-26 Audit and doubt point library construction method based on unstructured file analysis Pending CN118312491A (en)

Publications (1)

Publication Number Publication Date
CN118312491A true CN118312491A (en) 2024-07-09

Family

ID=

Similar Documents

Publication Publication Date Title
Máchová et al. Evaluating the quality of open data portals on the national level
Li et al. Extracting object-centric event logs to support process mining on databases
González-Barahona et al. On the reproducibility of empirical software engineering studies based on data retrieved from development repositories
US7165036B2 (en) System and method for managing a procurement process
Southerton Datafication
CN112001586A (en) Enterprise networking big data audit risk control architecture based on block chain consensus mechanism
CA2786355C (en) Associative memory technology for analysis of requests for proposal
Nikiforova Open Data Quality Evaluation: A comparative analysis of open data in Latvia
CN110929969A (en) Supplier evaluation method and device
CN110544035A (en) internal control detection method, system and computer readable storage medium
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
Chen et al. Exploring technology opportunities and evolution of IoT-related logistics services with text mining
Huang Data processing
Lawton et al. eDiscovery in digital forensic investigations
CN116228402A (en) Financial credit investigation feature warehouse technical support system
CN115982429A (en) Knowledge management method and system based on flow control
Hogan Data center
CN118312491A (en) Audit and doubt point library construction method based on unstructured file analysis
Battanta et al. Regtech: Case studies of cooperation with banks in italy
Alles et al. Process mining: A new research methodology for AIS
Mane et al. Big Data Forensic Analytics
Khameesy et al. A Proposed Model for Enhance the Effectiveness of E-Government Web Based Portal Services with Application on Egypt’s Government Portal
Dumbacher et al. SABLE: Tools for web crawling, web scraping, and text classification
Downs The data
Dagnaw et al. Data management practice in 21st century: systematic review

Legal Events

Date Code Title Description
PB01 Publication