CN113536787A - Method and equipment for establishing audit professional lexicon - Google Patents
Method and equipment for establishing audit professional lexicon Download PDFInfo
- Publication number
- CN113536787A CN113536787A CN202110797261.0A CN202110797261A CN113536787A CN 113536787 A CN113536787 A CN 113536787A CN 202110797261 A CN202110797261 A CN 202110797261A CN 113536787 A CN113536787 A CN 113536787A
- Authority
- CN
- China
- Prior art keywords
- threshold
- audit
- word
- words
- spliced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012550 audit Methods 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 22
- 230000011218 segmentation Effects 0.000 claims abstract description 16
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 238000012896 Statistical algorithm Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a method for establishing an audit professional lexicon, which comprises the following steps: obtaining audit related documents; preprocessing the audit related document; according to the non-audit professional word bank, performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words; splicing a plurality of independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words; respectively calculating the word frequency and the degree of freedom of each spliced word; presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
Description
Technical Field
The invention relates to a method for establishing an audit professional lexicon, belonging to the field of natural language processing.
Background
The extraction algorithms of the domain words are roughly divided into the following three categories:
(1) the rule-based extraction method comprises the following steps: and establishing corresponding rules according to the self composition structure of the words, the external context relation of the words and the like, and extracting the field words by using pattern matching.
(2) Based on statistical methods: depending on word frequency, likelihood ratio, hypothesis testing, mutual information, etc., the recognition effect for individual domain words and low frequency domain words is not ideal.
(3) A method based on a combination of statistics and rules: in view of the defects of the two methods, the extraction algorithm is carried out by combining the advantages of the two methods. Such methods can be classified into the following three categories: the rule is used as a filtering step in a statistical method; a specific rule is blended in a statistical method; statistics are performed using the "rules" information of the context. Many statistical algorithms use a filter rule template to filter out the unqualified term combinations, and practice has proved that this approach is simple and feasible.
At present, no more general and comprehensive audit professional word bank facing the audit field exists.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a technical scheme as follows:
the first technical scheme is as follows:
a method for establishing an audit professional lexicon comprises the following steps:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
S5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
Further, the pretreatment specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
Further, in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
Calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The second technical scheme is as follows:
a method of establishing an audited specialized thesaurus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
s5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
Further, the pretreatment specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
Further, in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
And taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The invention has the following beneficial effects:
1. the invention applies 2-gram technology to realize word splicing of word segmentation fragments, and realizes the discovery of new words in the field in text data through word frequency statistics and degree of freedom calculation of spliced words.
2. According to the invention, only a small number of new words need to be screened manually, so that the workload of extracting professional words in the audit field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the audit professional word bank is improved.
3. The invention utilizes the prior non-audit professional word stock to filter stop words at the beginning, and can collect a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) after a certain number of documents are processed. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example one
A method for establishing an audit professional lexicon comprises the following steps:
And S1, obtaining audit related documents including audit records, audit manuscripts, audit reports, audit correction reports and the like.
S2, preprocessing the audit related document;
s3, according to a non-audit professional lexicon (jieba lexicon tool), performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words. For example, the following steps are carried out: the word segmentation is carried out to obtain 'ren', 'Zhongo' and 'audit', and the spliced words 'ren zhong', 'Zhongo' and 'ren zhong audit' are obtained through a 2-gram word segmentation algorithm.
And S5, respectively calculating the word frequency and the degree of freedom of each spliced word. And the word frequency is the total number of times of the spliced words appearing in the document/the total number of words of the document.
S6, presetting a first threshold and a second threshold (in this embodiment, the first threshold is and the second threshold is); and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
The method has the advantages that the 2-gram technology is applied to realize word splicing of word segmentation fragments, and new words in the field in text data are found through word frequency statistics and freedom degree calculation of spliced words.
Example two
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
The method has the advantages that only a small number of new words need to be screened manually, the workload of extracting the professional vocabulary in the auditing field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the auditing professional lexicon is improved.
EXAMPLE III
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
The embodiment is advanced in that the existing non-audited professional lexicon is utilized to filter stop words at the beginning, and after a certain number of documents are processed, a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) can be collected. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Example four
Further, a third threshold is preset (in this embodiment, the third threshold is); calculating mutual information inside the spliced word;
And calculating the freedom degree of the spliced word larger than a third threshold value:
presetting a fourth threshold (in the embodiment, the fourth threshold is);
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
EXAMPLE five
An audit professional thesaurus establishing device comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
and S1, obtaining audit related documents including audit records, audit manuscripts, audit reports, audit correction reports and the like.
S2, preprocessing the audit related document;
s3, according to a non-audit professional lexicon (jieba lexicon tool), performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words. For example, the following steps are carried out: the word segmentation is carried out to obtain 'ren', 'Zhongo' and 'audit', and the spliced words 'ren zhong', 'Zhongo' and 'ren zhong audit' are obtained through a 2-gram word segmentation algorithm.
And S5, respectively calculating the word frequency and the degree of freedom of each spliced word. And the word frequency is the total number of times of the spliced words appearing in the document/the total number of words of the document.
S6, presetting a first threshold and a second threshold (in this embodiment, the first threshold is and the second threshold is); and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
The method has the advantages that the 2-gram technology is applied to realize word splicing of word segmentation fragments, and new words in the field in text data are found through word frequency statistics and freedom degree calculation of spliced words.
EXAMPLE six
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
The method has the advantages that only a small number of new words need to be screened manually, the workload of extracting the professional vocabulary in the auditing field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the auditing professional lexicon is improved.
EXAMPLE seven
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
The embodiment is advanced in that the existing non-audited professional lexicon is utilized to filter stop words at the beginning, and after a certain number of documents are processed, a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) can be collected. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Example eight
Further, a third threshold is preset (in this embodiment, the third threshold is); calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word larger than a third threshold value:
presetting a fourth threshold (in the embodiment, the fourth threshold is);
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
Calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (6)
1. A method for establishing an audit professional lexicon is characterized by comprising the following steps:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
S5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
2. The method for establishing an audit professional thesaurus according to claim 1, wherein step S6 further comprises further auditing new words by human; and storing the new words which pass the audit into an audit professional word bank.
3. The method for establishing the audited professional thesaurus according to claim 1, wherein the step S6 further comprises storing the concatenated words with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as stop words in the non-audited professional thesaurus.
4. The method for establishing the audit professional lexicon according to claim 3, wherein the preprocessing specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
5. The method for establishing the audit professional lexicon according to claim 4, wherein in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
And calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
6. An audit professional thesaurus establishing device, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method of establishing an audit professional thesaurus as claimed in any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110797261.0A CN113536787A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing audit professional lexicon |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110797261.0A CN113536787A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing audit professional lexicon |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113536787A true CN113536787A (en) | 2021-10-22 |
Family
ID=78099121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110797261.0A Pending CN113536787A (en) | 2021-07-14 | 2021-07-14 | Method and equipment for establishing audit professional lexicon |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113536787A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
-
2021
- 2021-07-14 CN CN202110797261.0A patent/CN113536787A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN108038119A (en) * | 2017-11-01 | 2018-05-15 | 平安科技(深圳)有限公司 | Utilize the method, apparatus and storage medium of new word discovery investment target |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
Non-Patent Citations (1)
Title |
---|
刘永芳: "中国英语新词语料库构建技术研究", 《COMPUTER ENGINEERING AND APPLICATIONS计算机工程与应用》, vol. 56, no. 16, pages 165 - 168 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106528532B (en) | Text error correction method, device and terminal | |
CN109710947B (en) | Electric power professional word bank generation method and device | |
CN107301244A (en) | Method, device, system and the trade mark memory of a kind of trade mark point card processing | |
WO2021073116A1 (en) | Method and apparatus for generating legal document, device and storage medium | |
CN108845982B (en) | Chinese word segmentation method based on word association characteristics | |
CN108573045A (en) | A kind of alignment matrix similarity retrieval method based on multistage fingerprint | |
CN105975491A (en) | Enterprise news analysis method and system | |
CN109582787B (en) | Entity classification method and device for corpus data in thermal power generation field | |
WO2022095353A1 (en) | Speech recognition result evaluation method, apparatus and device, and storage medium | |
CN107066541A (en) | The processing method and system of customer service question and answer data | |
CN110334343B (en) | Method and system for extracting personal privacy information in contract | |
CN110929520A (en) | Non-named entity object extraction method and device, electronic equipment and storage medium | |
CN114266256A (en) | Method and system for extracting new words in field | |
CN114997169B (en) | Entity word recognition method and device, electronic equipment and readable storage medium | |
CN114785606A (en) | Log anomaly detection method based on pre-training LogXLNET model, electronic device and storage medium | |
CN112199937A (en) | Short text similarity analysis method and system, computer equipment and medium | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
KR101179613B1 (en) | Method of automatic patent document categorization adjusting association rules and frequent itemset | |
CN113536787A (en) | Method and equipment for establishing audit professional lexicon | |
CN111199801A (en) | Construction method and application of model for identifying disease types of medical records | |
CN109740147B (en) | Duplicate removal matching analysis method for large-number talent resume | |
CN111898378A (en) | Industry classification method and device for government and enterprise clients, electronic equipment and storage medium | |
CN110413985B (en) | Related text segment searching method and device | |
Hawas | Towards a new approach for Arabic root extraction: Exploit relations between the word letters and their placement in the word for Arabic root extraction | |
CN111341404B (en) | Electronic medical record data set analysis method and system based on ernie model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |