CN115168575A

CN115168575A - Subject supplement method applied to audit field and related equipment

Info

Publication number: CN115168575A
Application number: CN202210743520.6A
Authority: CN
Inventors: 王开志; 王开向; 王涌
Original assignee: Beijing Zhizhen Cloud Intelligent Technology Co ltd
Current assignee: Beijing Zhizhen Cloud Intelligent Technology Co ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-10-11

Abstract

The application provides a subject completing method and related equipment applied to the auditing field, which are characterized in that candidate subjects in an auditing text are extracted through a named entity recognition model to form a candidate subject set, and the types of subjects missing in sentences of subjects to be completed are judged through a classification algorithm model, so that corresponding candidate subjects are selected from the candidate subject set to complete subject completion. By the subject completing method, subjects of the triple missing in the audit text can be effectively completed, so that a complete audit field knowledge map is constructed.

Description

Subject supplement method applied to audit field and related equipment

Technical Field

The application relates to the technical field of knowledge maps, in particular to a subject supplement method and related equipment applied to the field of auditing.

Background

With the development of the artificial intelligence field, the knowledge graph is used as a novel knowledge base, and the knowledge graph facing to each field is constructed as a hotspot for current research. In the audit field, due to the fact that the subject is missing in the audit report, the complete SPO triple cannot be effectively extracted, the complete degree of the knowledge graph structure in the audit field is influenced, and effective information is difficult to provide for technical staff in the audit field.

Disclosure of Invention

In view of the above, an object of the present application is to provide a subject completing method and related device applied in the auditing field.

The first aspect of the present application provides a subject completing method applied to the auditing field, including:

obtaining an audit text, wherein the audit text comprises sentences of a subject to be supplemented;

inputting the audit text into a named entity recognition model which is subjected to first pre-training, and obtaining a candidate subject set through the named entity recognition model;

predicting the sentences of the subject to be supplemented by adopting a classification algorithm model which is pre-trained secondly to determine subject missing categories corresponding to the sentences of the subject to be supplemented;

and selecting candidate subjects from the candidate subject set according to the subject missing categories so as to complete the sentences of the subjects to be completed.

Optionally, the first pre-training includes:

acquiring a historical audit text;

sentence division processing is carried out on the historical audit text;

marking all subjects in the history audit text after sentence division processing;

performing the first pre-training on the named entity recognition model according to the marked historical audit text;

and stopping the first pre-training if the training cutoff condition is reached.

Optionally, the second pre-training includes:

marking the missing subject corresponding to the sentence of the subject to be supplemented in the historical audit text after the sentence division processing;

performing the second pre-training on the classification algorithm model according to the sentences of the subject to be supplemented in the marked historical audit text;

and stopping the second pre-training if the training cutoff condition is reached.

Optionally, the candidate subject set includes a plurality of auditing units and a plurality of audited units, the subject missing category includes missing auditing unit subjects and missing audited unit subjects, and the selecting a candidate subject from the candidate subject set according to the subject missing category includes:

if the subject missing type is a missing audit unit subject, selecting the candidate subject from a plurality of audit units in the candidate subject set;

and if the subject missing type is the missing subject of the audited unit, selecting the candidate subject from a plurality of audited units in the candidate subject set.

Optionally, the named entity recognition model is a BERT-CRF model.

Optionally, the classification algorithm model is a FastText model.

The second aspect of the present application further provides a subject completing apparatus for use in the audit field, comprising:

the obtaining module is configured to obtain an audit text, and the audit text comprises sentences of a subject to be supplemented;

the candidate subject generating module is configured to input the audit text into a named entity recognition model which is subjected to first pre-training, and a candidate subject set is obtained through the named entity recognition model;

the missing class determining module is configured to predict the sentence of the subject to be supplemented by adopting a second pre-trained classification algorithm model so as to determine a subject missing class corresponding to the sentence of the subject to be supplemented;

and the completion module is configured to select a candidate subject from the candidate subject set according to the subject missing category so as to complete the sentence of the subject to be completed.

Optionally, the completion module is further configured to,

A third aspect of the present application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method as described above when executing the computer program.

A fourth aspect of the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to the subject completing method and the related equipment applied to the audit field, the candidate subjects in the audit text are extracted through the named entity recognition model to form a candidate subject set, and the type of the subject missing in the sentence of the subject to be completed is judged through the classification algorithm model, so that the corresponding candidate subjects are selected from the candidate subject set to complete subject completion. By the subject completing method, subjects of the triple missing in the audit text can be effectively completed, so that a complete audit field knowledge map is constructed.

Drawings

In order to more clearly illustrate the technical solutions in the present application or related technologies, the drawings required for the embodiments or related technologies in the following description are briefly introduced, and it is obvious that the drawings in the following description are only the embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of subject completion applied to the field of auditing according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a first pre-training method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a second pre-training method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of subject completion applied to the auditing field according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used only to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background art, in order to improve the integrity of the knowledge graph structure in the audit field, in the process of establishing the knowledge graph by the personnel of the audit unit, the subject in the short sentence or the local character is predicted for reference. According to the method and the device, the named entity identification is adopted to identify the subject in the audit text, and the Fasttext model is used to predict and classify the subject of the sentence or the local text, so that the subject prediction of the audit report in the process of extracting the SPO (S represents subject of the subject, P represents predicate, and O represents object of the object) triple is realized, and the integrity of extracting the SPO triple is improved. In addition, due to the improvement of the integrity of the SPO triple, the integrity degree of related information in the knowledge graph is improved.

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

The application provides a subject completing method applied to the auditing field, and with reference to fig. 1, the method comprises the following steps:

102, obtaining an audit text, wherein the audit text comprises sentences of a subject to be supplemented.

Specifically, the audit text in this embodiment is an audit report, and the audit report is a written document in which a registered accountant issues audit opinions on the validity and the compliance of a financial statement. A large number of auditing units or audited units can appear in an auditing report, and the situation of subject loss can exist in some short sentences or local texts, so that auditing personnel can not construct a complete auditing domain knowledge map according to the auditing report. The sentences of the subject to be completed refer to sentences which lack audit units or audited units.

And 104, inputting the audit text into a named entity recognition model which is subjected to first pre-training, and obtaining a candidate subject set through the named entity recognition model.

Specifically, in order to complement a sentence with a missing subject, all subjects appearing in the audit text need to be identified, because the missing subject often exists in the audit text. The subject in the audit text is identified by means of a named entity identification model, and the subject in the audit text can be output through the model after pre-training. The subjects output by the named entity recognition model are merged to form a set of candidate subjects.

And 106, predicting the sentence of the subject to be supplemented by adopting a classification algorithm model which is subjected to second pre-training so as to determine the subject missing category corresponding to the sentence of the subject to be supplemented.

Specifically, the types of subjects missing from the sentences of the subject to be complemented may be different, and the missing may be an auditing unit or an audited unit. Therefore, it is necessary to confirm that each sentence of the subject to be complemented lacks the type of the subject before the sentence is complemented. The classification algorithm model can be used for well distinguishing and predicting the types of the missing subjects in the sentences, so that the sentences of the subjects to be supplemented are input into the classification algorithm model, and the categories of the missing subjects can be output through the classification algorithm model.

And 108, selecting a candidate subject from the candidate subject set according to the subject missing type so as to complete the sentence of the subject to be completed.

Specifically, the candidate subject set obtained in step 104 includes all candidate subjects in the audit text, and the subject missing type obtained by prediction according to the classification algorithm model is completed by finding out the corresponding associated candidate subject from the candidate subject set.

Based on the steps 102 to 108, the named entity recognition model and the classification algorithm model are introduced to complete the audit text with the missing subject, so that the integrity of triple extraction in the audit text can be effectively improved, and powerful support is provided for further constructing a knowledge graph in the audit field.

In some embodiments, the first pre-training, referring to fig. 2, comprises the following steps:

step 202, obtaining a historical audit text;

step 204, sentence dividing processing is carried out on the historical audit text;

step 206, marking all subjects in the history audit text after sentence segmentation;

208, performing the first pre-training on the named entity recognition model according to the marked historical audit text;

and step 210, stopping the first pre-training if a training cutoff condition is reached.

Specifically, the named entity recognition model is pre-trained. The training set adopted by the pre-training is obtained from historical audit texts. Firstly, the historical audit text needs to be subjected to sentence segmentation processing, for example, a jieba word segmentation tool is used for carrying out sentence segmentation on the historical audit text. And then, marking the subject in the historical audit text after the sentence division, and marking the subject and the subject type appearing in each sentence to form a marking sample. The method comprises the steps of dividing the audit text with the marks into a training set and a testing set, training the named entity recognition model through the training set, adjusting model parameters, testing the named entity recognition model through the testing set, and judging whether the model meets the pre-training standard. It should be noted that the statements using the labeled audit text may also be divided into a plurality of sets of training sets and test sets, and multiple rounds of training are performed on the named entity recognition model according to the plurality of sets of training sets and test sets, and each round of training uses one set of training set and test set, so as to further improve the accuracy of the model. The training cutoff condition may be that the lossy model has a loss function that converges, or that a predetermined number of training passes is reached, etc. When the training cutoff condition is reached, the first pre-training can be stopped, preventing the model from being over-fitted.

In some embodiments, the second pre-training, referring to fig. 3, comprises the following steps:

step 302, marking the missing subject corresponding to the sentence of the subject to be supplemented in the historical audit text after the sentence division processing;

step 304, performing second pre-training on the classification algorithm model according to the sentences of the subject to be supplemented in the historical audit text with labels;

and step 306, if the training cutoff condition is reached, stopping the second pre-training.

Specifically, for the second pre-training of the classification algorithm model, on the basis of the historical audit text for the first pre-training, the historical audit text with labels is used as training data to perform the second pre-training on the classification algorithm model. And marking the subject missing type of the historical audit text after the sentence division to form a marking sample. And performing second pre-training on the classification algorithm model through the labeled samples, wherein the classification algorithm model learns the vector characteristics of the labeled samples in the second pre-training process, so that the purpose of accurately judging the types of the missing subjects of the sentences is achieved. The second pre-training method comprises the steps of dividing the audit text with the marks into a training set and a testing set, training the classification algorithm model through the training set, adjusting the parameters of the model, testing the classification algorithm model through the testing set, and judging whether the model meets the pre-training standard. It should be noted that, statements that use marked audit texts may also be divided into multiple sets of training sets and test sets, and multiple rounds of training are performed on the classification algorithm model according to the multiple sets of training sets and test sets, and each round of training uses one set of training set and test set, so as to further improve the accuracy of the model. The training cutoff condition may be that the lossy model has a loss function that converges, or that a predetermined number of training passes is reached, etc. When the training cutoff condition is reached, the second pre-training can be stopped, preventing overfitting of the model.

In some embodiments, the selecting the candidate subject from the candidate subject set according to the subject missing category includes:

and if the subject missing type is missing subject of audited unit, selecting the candidate subject from a plurality of audited units in the candidate subject set.

In particular, the auditing unit and audited unit involved in the auditing report may be one or more. The candidate subject set comprises a candidate auditing unit subset and a candidate audited unit subset. The subject missing category is also set as the subject of the missing audit unit and the subject of the missing audited unit. It can be understood that when the subject missing category is determined to be the missing audit unit subject through the classification algorithm model, the corresponding audit unit subject can be selected from the candidate audit unit subset in the candidate subject set to complete the sentence of the subject to be completed. When the audit unit subset comprises a plurality of audit units, context judgment needs to be connected, and a proper audit unit is selected as a candidate subject. And when the audit unit subset only contains one audit unit, the audit unit is the final candidate subject. Similarly, when the subject missing type is determined to be the missing subject of the audited unit through the classification algorithm model, the corresponding subject of the audited unit can be selected from the candidate audited unit subset in the candidate subject set to complete the sentence of the subject to be completed. When the sub-set of the audited units comprises a plurality of audited units, the context is required to be connected for judgment, and a proper audited unit is selected as a candidate subject. When the sub-set of the audited units only contains one audited unit, the audited unit is the final candidate subject.

Therefore, by the method, the candidate subject can be accurately selected to complement the sentences of the subject to be complemented, reference is provided for auditors, and the integrity of triple extraction in the audit text is improved.

In some embodiments, the named entity identification model is a BERT-CRF model.

In particular, named entity recognition is one of the hot research directions in natural language processing, and aims to recognize named entities in text and generalize the named entities into corresponding entity types. In this embodiment, the named entity recognition model adopts a BERT-CRF model, which has a higher training speed and a higher accuracy than other models and can complete a sequence labeling task with high quality.

In some embodiments, the classification algorithm model is a FastText model. The FastText model is a text classification tool proposed by Facebook, and has the advantages that the similar precision of some deep networks can be obtained on shallow training, and the training speed is high. The FastText model is very simple in structure, with three layers, an input layer, a hidden layer, and an output layer. The input layer is used for embedding the document to obtain an embedded vector, and the embedded vector comprises N-gram features. The hidden layer is used for summing and averaging input data, and the output layer outputs a corresponding label of the document. The subject missing category of the sentence of the subject to be complemented can be accurately judged through the FastText model so as to ensure the complementing quality of the triples.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In this distributed scenario, one device of the multiple devices may only perform one or more steps of the method of the embodiment of the present application, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The application also provides a subject completing device applied to the audit field.

Referring to fig. 4, the subject completing apparatus applied to the auditing field includes:

an obtaining module 402, configured to obtain an audit text, where the audit text includes a sentence of a subject to be supplemented;

a candidate subject generating module 404 configured to input the audit text into a first pre-trained named entity recognition model, and obtain a candidate subject set through the named entity recognition model;

a missing class determination module 406, configured to predict the sentence of the subject to be supplemented by using a second pre-trained classification algorithm model, so as to determine a subject missing class corresponding to the sentence of the subject to be supplemented;

a completion module 408 configured to select a candidate subject from the candidate subject set according to the subject missing category to complete the sentence of the subject to be completed.

In some embodiments, a first pre-training module 410 is also included, configured to,

acquiring a historical audit text;

sentence division processing is carried out on the historical audit text;

In some embodiments, a second pre-training module 412 is also included, configured to,

In some embodiments, the set of candidate subjects includes a plurality of audit units and a plurality of audited units, the subject missing category includes missing audit unit subjects and missing audited unit subjects, the completion module 408 is specifically configured to,

In some embodiments, the named entity identification model is a BERT-CRF model.

In some embodiments, the classification algorithm model is a FastText model.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The device of the above embodiment is used to implement the corresponding subject completing method applied to the auditing field in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

The application also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the subject completing method applied to the auditing field of any embodiment is realized.

Fig. 5 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static Memory device, a dynamic Memory device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

The bus 1050 includes a path to transfer information between various components of the device, such as the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding subject completing method applied to the auditing field in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

The present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the subject complementation method applied to the auditing field according to any one of the above embodiments.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, for storing information may be implemented in any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the subject completing method applied to the auditing field according to any of the above embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A subject completing method applied to the auditing field is characterized by comprising the following steps:

and selecting a candidate subject from the candidate subject set according to the subject missing category so as to complete the sentences of the subject to be completed.

2. The method of claim 1, wherein the first pre-training comprises:

acquiring a historical audit text;

sentence division processing is carried out on the historical audit text;

marking all subjects in the historical audit text subjected to sentence division processing;

3. The method of claim 2, wherein the second pre-training comprises:

marking the missing subject corresponding to the sentence of the subject to be supplemented in the historical audit text processed by the sentence division;

4. The method according to claim 1, wherein the candidate subject set includes a plurality of auditing units and a plurality of audited units, the subject missing classification includes a missing auditing unit subject and a missing audited unit subject, and the selecting a candidate subject from the candidate subject set according to the subject missing classification includes:

5. The method of claim 1, wherein the named entity recognition model is a BERT-CRF model.

6. The method of claim 1, wherein the classification algorithm model is a FastText model.

7. The utility model provides a subject completion device for audit field which characterized in that includes:

8. The apparatus of claim 7, wherein the completion module is further configured to,

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.