CN113536787A - Method and equipment for establishing audit professional lexicon - Google Patents

Method and equipment for establishing audit professional lexicon Download PDF

Info

Publication number
CN113536787A
CN113536787A CN202110797261.0A CN202110797261A CN113536787A CN 113536787 A CN113536787 A CN 113536787A CN 202110797261 A CN202110797261 A CN 202110797261A CN 113536787 A CN113536787 A CN 113536787A
Authority
CN
China
Prior art keywords
threshold
audit
word
words
spliced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110797261.0A
Other languages
Chinese (zh)
Inventor
王秋琳
郑略省
吕世雷
张萍
庄莉
梁懿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Information and Telecommunication Co Ltd
Fujian Yirong Information Technology Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Information and Telecommunication Co Ltd, Fujian Yirong Information Technology Co Ltd, Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd filed Critical State Grid Information and Telecommunication Co Ltd
Priority to CN202110797261.0A priority Critical patent/CN113536787A/en
Publication of CN113536787A publication Critical patent/CN113536787A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for establishing an audit professional lexicon, which comprises the following steps: obtaining audit related documents; preprocessing the audit related document; according to the non-audit professional word bank, performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words; splicing a plurality of independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words; respectively calculating the word frequency and the degree of freedom of each spliced word; presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.

Description

Method and equipment for establishing audit professional lexicon
Technical Field
The invention relates to a method for establishing an audit professional lexicon, belonging to the field of natural language processing.
Background
The extraction algorithms of the domain words are roughly divided into the following three categories:
(1) the rule-based extraction method comprises the following steps: and establishing corresponding rules according to the self composition structure of the words, the external context relation of the words and the like, and extracting the field words by using pattern matching.
(2) Based on statistical methods: depending on word frequency, likelihood ratio, hypothesis testing, mutual information, etc., the recognition effect for individual domain words and low frequency domain words is not ideal.
(3) A method based on a combination of statistics and rules: in view of the defects of the two methods, the extraction algorithm is carried out by combining the advantages of the two methods. Such methods can be classified into the following three categories: the rule is used as a filtering step in a statistical method; a specific rule is blended in a statistical method; statistics are performed using the "rules" information of the context. Many statistical algorithms use a filter rule template to filter out the unqualified term combinations, and practice has proved that this approach is simple and feasible.
At present, no more general and comprehensive audit professional word bank facing the audit field exists.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a technical scheme as follows:
the first technical scheme is as follows:
a method for establishing an audit professional lexicon comprises the following steps:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
S5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
Further, the pretreatment specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
Further, in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
Calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The second technical scheme is as follows:
a method of establishing an audited specialized thesaurus comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
s5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
Further, the pretreatment specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
Further, in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
And taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The invention has the following beneficial effects:
1. the invention applies 2-gram technology to realize word splicing of word segmentation fragments, and realizes the discovery of new words in the field in text data through word frequency statistics and degree of freedom calculation of spliced words.
2. According to the invention, only a small number of new words need to be screened manually, so that the workload of extracting professional words in the audit field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the audit professional word bank is improved.
3. The invention utilizes the prior non-audit professional word stock to filter stop words at the beginning, and can collect a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) after a certain number of documents are processed. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example one
A method for establishing an audit professional lexicon comprises the following steps:
And S1, obtaining audit related documents including audit records, audit manuscripts, audit reports, audit correction reports and the like.
S2, preprocessing the audit related document;
s3, according to a non-audit professional lexicon (jieba lexicon tool), performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words. For example, the following steps are carried out: the word segmentation is carried out to obtain 'ren', 'Zhongo' and 'audit', and the spliced words 'ren zhong', 'Zhongo' and 'ren zhong audit' are obtained through a 2-gram word segmentation algorithm.
And S5, respectively calculating the word frequency and the degree of freedom of each spliced word. And the word frequency is the total number of times of the spliced words appearing in the document/the total number of words of the document.
S6, presetting a first threshold and a second threshold (in this embodiment, the first threshold is and the second threshold is); and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
The method has the advantages that the 2-gram technology is applied to realize word splicing of word segmentation fragments, and new words in the field in text data are found through word frequency statistics and freedom degree calculation of spliced words.
Example two
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
The method has the advantages that only a small number of new words need to be screened manually, the workload of extracting the professional vocabulary in the auditing field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the auditing professional lexicon is improved.
EXAMPLE III
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
The embodiment is advanced in that the existing non-audited professional lexicon is utilized to filter stop words at the beginning, and after a certain number of documents are processed, a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) can be collected. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Example four
Further, a third threshold is preset (in this embodiment, the third threshold is); calculating mutual information inside the spliced word;
And calculating the freedom degree of the spliced word larger than a third threshold value:
presetting a fourth threshold (in the embodiment, the fourth threshold is);
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
EXAMPLE five
An audit professional thesaurus establishing device comprising a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of:
and S1, obtaining audit related documents including audit records, audit manuscripts, audit reports, audit correction reports and the like.
S2, preprocessing the audit related document;
s3, according to a non-audit professional lexicon (jieba lexicon tool), performing word segmentation on the preprocessed audit related documents and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain a plurality of spliced words. For example, the following steps are carried out: the word segmentation is carried out to obtain 'ren', 'Zhongo' and 'audit', and the spliced words 'ren zhong', 'Zhongo' and 'ren zhong audit' are obtained through a 2-gram word segmentation algorithm.
And S5, respectively calculating the word frequency and the degree of freedom of each spliced word. And the word frequency is the total number of times of the spliced words appearing in the document/the total number of words of the document.
S6, presetting a first threshold and a second threshold (in this embodiment, the first threshold is and the second threshold is); and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
The method has the advantages that the 2-gram technology is applied to realize word splicing of word segmentation fragments, and new words in the field in text data are found through word frequency statistics and freedom degree calculation of spliced words.
EXAMPLE six
Further, step S6 includes further reviewing the new word by a human; and storing the new words which pass the audit into an audit professional word bank.
The method has the advantages that only a small number of new words need to be screened manually, the workload of extracting the professional vocabulary in the auditing field from the document by pure manual work is reduced to a great extent, and the construction efficiency of the auditing professional lexicon is improved.
EXAMPLE seven
Further, step S6 includes storing the concatenated word with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as the stop word into the non-audit professional lexicon.
The embodiment is advanced in that the existing non-audited professional lexicon is utilized to filter stop words at the beginning, and after a certain number of documents are processed, a large number of non-new words (namely spliced words with the word frequency not exceeding a first threshold or the degree of freedom not exceeding a second threshold) can be collected. And the non-new words are stored in a non-audit professional word bank for filtering, so that the subsequent calculated amount can be greatly reduced, and the new words can be efficiently found.
Example eight
Further, a third threshold is preset (in this embodiment, the third threshold is); calculating mutual information inside the spliced word;
and calculating the freedom degree of the spliced word larger than a third threshold value:
presetting a fourth threshold (in the embodiment, the fourth threshold is);
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
Calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A method for establishing an audit professional lexicon is characterized by comprising the following steps:
s1, obtaining audit related documents;
s2, preprocessing the audit related document;
s3, segmenting the preprocessed audit related documents according to the non-audit professional lexicon, and removing stop words to obtain a plurality of independent words;
s4, splicing the independent words through a 2-gram word segmentation algorithm to obtain spliced words;
S5, respectively calculating the word frequency and the degree of freedom of each spliced word;
s6, presetting a first threshold and a second threshold; and storing the spliced words with the word frequency exceeding a first threshold and the degree of freedom exceeding a second threshold as new words into an audit professional word bank.
2. The method for establishing an audit professional thesaurus according to claim 1, wherein step S6 further comprises further auditing new words by human; and storing the new words which pass the audit into an audit professional word bank.
3. The method for establishing the audited professional thesaurus according to claim 1, wherein the step S6 further comprises storing the concatenated words with the word frequency not exceeding the first threshold or the degree of freedom not exceeding the second threshold as stop words in the non-audited professional thesaurus.
4. The method for establishing the audit professional lexicon according to claim 3, wherein the preprocessing specifically comprises: the unstructured document is converted to a structured document using a POI tool or a Tika tool.
5. The method for establishing the audit professional lexicon according to claim 4, wherein in step S5, the specific step of calculating the degree of freedom of the concatenated word is:
presetting a third threshold; calculating mutual information inside the spliced word;
And calculating the freedom degree of the spliced word with the mutual information larger than a third threshold value:
presetting a fourth threshold;
calculating left neighbor information entropy of the spliced word of which the mutual information is greater than a third threshold, if the left neighbor information entropy does not exceed a fourth threshold, continuing to perform left expansion calculation on the next left neighbor information entropy until the leftmost boundary is reached or one left neighbor information entropy exceeds the fourth threshold, and recording the left neighbor information entropy exceeding the fourth threshold as a first scale value;
calculating right neighbor information entropies of all spliced words containing the first scale value, and if the right neighbor information entropies do not exceed a fourth threshold value, continuing to expand rightwards to calculate the next right neighbor information entropy; until the rightmost boundary is reached or the right neighbor information entropy exceeds a fourth threshold, recording the right neighbor information entropy exceeding the fourth threshold as a second scale value;
and taking the smaller of the first scale value and the second scale value as the degree of freedom of the corresponding spliced word.
6. An audit professional thesaurus establishing device, comprising a memory and a processor, wherein the memory stores instructions adapted to be loaded by the processor and to perform a method of establishing an audit professional thesaurus as claimed in any one of claims 1 to 5.
CN202110797261.0A 2021-07-14 2021-07-14 Method and equipment for establishing audit professional lexicon Pending CN113536787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110797261.0A CN113536787A (en) 2021-07-14 2021-07-14 Method and equipment for establishing audit professional lexicon

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110797261.0A CN113536787A (en) 2021-07-14 2021-07-14 Method and equipment for establishing audit professional lexicon

Publications (1)

Publication Number Publication Date
CN113536787A true CN113536787A (en) 2021-10-22

Family

ID=78099121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110797261.0A Pending CN113536787A (en) 2021-07-14 2021-07-14 Method and equipment for establishing audit professional lexicon

Country Status (1)

Country Link
CN (1) CN113536787A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘永芳: "中国英语新词语料库构建技术研究", 《COMPUTER ENGINEERING AND APPLICATIONS计算机工程与应用》, vol. 56, no. 16, pages 165 - 168 *

Similar Documents

Publication Publication Date Title
CN106528532B (en) Text error correction method, device and terminal
CN109710947B (en) Electric power professional word bank generation method and device
CN107301244A (en) Method, device, system and the trade mark memory of a kind of trade mark point card processing
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN108845982B (en) Chinese word segmentation method based on word association characteristics
CN108573045A (en) A kind of alignment matrix similarity retrieval method based on multistage fingerprint
CN105975491A (en) Enterprise news analysis method and system
CN109582787B (en) Entity classification method and device for corpus data in thermal power generation field
WO2022095353A1 (en) Speech recognition result evaluation method, apparatus and device, and storage medium
CN107066541A (en) The processing method and system of customer service question and answer data
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN114266256A (en) Method and system for extracting new words in field
CN114997169B (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN114785606A (en) Log anomaly detection method based on pre-training LogXLNET model, electronic device and storage medium
CN112199937A (en) Short text similarity analysis method and system, computer equipment and medium
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
KR101179613B1 (en) Method of automatic patent document categorization adjusting association rules and frequent itemset
CN113536787A (en) Method and equipment for establishing audit professional lexicon
CN111199801A (en) Construction method and application of model for identifying disease types of medical records
CN109740147B (en) Duplicate removal matching analysis method for large-number talent resume
CN111898378A (en) Industry classification method and device for government and enterprise clients, electronic equipment and storage medium
CN110413985B (en) Related text segment searching method and device
Hawas Towards a new approach for Arabic root extraction: Exploit relations between the word letters and their placement in the word for Arabic root extraction
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination