CN113536787A

CN113536787A - Method and equipment for establishing audit professional lexicon

Info

Publication number: CN113536787A
Application number: CN202110797261.0A
Authority: CN
Inventors: 王秋琳; 郑略省; 吕世雷; 张萍; 庄莉; 梁懿
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-22

Abstract

The invention relates to a method for establishing an audit professional lexicon, which comprises the following steps: obtaining audit related documents; preprocessing the audit related document; according to the non-audit professional word bank, performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words; splicing a plurality of independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words; respectively calculating the word frequency and the degree of freedom of each spliced word; presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.

Description

Method and equipment for establishing audit professional lexicon

Technical Field

The invention relates to a method for establishing an audit professional lexicon, belonging to the field of natural language processing.

Background

The extraction algorithms of the domain words are roughly divided into the following three categories:

(1) the rule-based extraction method comprises the following steps: and establishing corresponding rules according to the self composition structure of the words, the external context relation of the words and the like, and extracting the field words by using pattern matching.

(2) Based on statistical methods: depending on word frequency, likelihood ratio, hypothesis testing, mutual information, etc., the recognition effect for individual domain words and low frequency domain words is not ideal.

(3) A method based on a combination of statistics and rules: in view of the defects of the two methods, the extraction algorithm is carried out by combining the advantages of the two methods. Such methods can be classified into the following three categories: the rule is used as a filtering step in a statistical method; a specific rule is blended in a statistical method; statistics are performed using the "rules" information of the context. Many statistical algorithms use a filter rule template to filter out the unqualified term combinations, and practice has proved that this approach is simple and feasible.

At present, no more general and comprehensive audit professional word bank facing the audit field exists.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a technical scheme as follows:

the first technical scheme is as follows:

a method for establishing an audit professional lexicon comprises the following steps:

s1, obtaining audit related documents;

s2, preprocessing the audit related document;

s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;

s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;

S5, respectively calculating the word frequency and the degree of freedom of each spliced word;

s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.

Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.

Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.

Further, the pretreatment specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.

Further, in step S5, the specific step of calculating the degree of freedom of the concatenated word is:

presetting a third threshold; calculating mutual information inside the spliced word;

and calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:

presetting a fourth threshold;

calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;

Calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;

and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.

The second technical scheme is as follows:

a method of establishing an audited specialized thesaurus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:

s1, obtaining audit related documents;

s2, preprocessing the audit related document;

presetting a fourth threshold;

The invention has the following beneficial effects:

1. the invention applies 2-gram technology to realize word splicing of word segmentation fragments, and realizes the discovery of new words in the field in text data through word frequency statistics and degree of freedom calculation of spliced words.

2. According to the invention, only a small number of new words need to be screened manually, so that the workload of extracting professional words in the audit field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the audit professional word bank is improved.

3. The invention utilizes the prior non-audit professional word stock to filter stop words at the beginning, and can collect a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) after a certain number of documents are processed. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Example one

And S1, obtaining audit related documents including audit records, audit manuscripts, audit reports, audit correction reports and the like.

S2, preprocessing the audit related document;

s3, according to a non-audit professional lexicon (jieba lexicon tool), performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words;

s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words. For example, the following steps are carried out: the word segmentation is carried out to obtain 'ren', 'Zhongo' and 'audit', and the spliced words 'ren zhong', 'Zhongo' and 'ren zhong audit' are obtained through a 2-gram word segmentation algorithm.

And S5, respectively calculating the word frequency and the degree of freedom of each spliced word. And the word frequency is the total number of times of the spliced words appearing in the document/the total number of words of the document.

S6, presetting a first threshold and a second threshold (in this embodiment, the first threshold is and the second threshold is); and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.

The method has the advantages that the 2-gram technology is applied to realize word splicing of word segmentation fragments, and new words in the field in text data are found through word frequency statistics and freedom degree calculation of spliced words.

Example two

The method has the advantages that only a small number of new words need to be screened manually, the workload of extracting the professional vocabulary in the auditing field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the auditing professional lexicon is improved.

EXAMPLE III

The embodiment is advanced in that the existing non-audited professional lexicon is utilized to filter stop words at the beginning, and after a certain number of documents are processed, a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) can be collected. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.

Example four

Further, a third threshold is preset (in this embodiment, the third threshold is); calculating mutual information inside the spliced word;

And calculating the freedom degree of the spliced word larger than a third threshold value:

presetting a fourth threshold (in the embodiment, the fourth threshold is);

EXAMPLE five

An audit professional thesaurus establishing device comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:

S2, preprocessing the audit related document;

EXAMPLE six

EXAMPLE seven

Example eight

presetting a fourth threshold (in the embodiment, the fourth threshold is);

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for establishing an audit professional lexicon is characterized by comprising the following steps:

s1, obtaining audit related documents;

s2, preprocessing the audit related document;

2. The method for establishing an audit professional thesaurus according to claim 1, wherein step S6 further comprises further auditing new words by human; and storing the new words which pass the audit into an audit professional word bank.

3. The method for establishing the audited professional thesaurus according to claim 1, wherein the step S6 further comprises storing the concatenated words with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as stop words in the non-audited professional thesaurus.

4. The method for establishing the audit professional lexicon according to claim 3, wherein the preprocessing specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.

5. The method for establishing the audit professional lexicon according to claim 4, wherein in step S5, the specific step of calculating the degree of freedom of the concatenated word is:

presetting a fourth threshold;

6. An audit professional thesaurus establishing device, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method of establishing an audit professional thesaurus as claimed in any one of claims 1 to 5.