CN113919342A - Method for constructing accounting term co-occurrence network diagram - Google Patents

Method for constructing accounting term co-occurrence network diagram Download PDF

Info

Publication number
CN113919342A
CN113919342A CN202111096537.9A CN202111096537A CN113919342A CN 113919342 A CN113919342 A CN 113919342A CN 202111096537 A CN202111096537 A CN 202111096537A CN 113919342 A CN113919342 A CN 113919342A
Authority
CN
China
Prior art keywords
accounting
words
pagerank
term
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111096537.9A
Other languages
Chinese (zh)
Inventor
潘定
梁倬骞
叶迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202111096537.9A priority Critical patent/CN113919342A/en
Publication of CN113919342A publication Critical patent/CN113919342A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for constructing an accounting term co-occurrence network graph, which comprises the steps of extracting semantic elements in an accounting field, namely constructing a directed network graph for words in an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, and finally obtaining a candidate set of the semantic elements of accounting terms based on synonym forest combination. The invention designs a semantic element extraction method based on graph theory aiming at the linguistic data of an accounting dictionary by utilizing the characteristics of knowledge in the accounting field. The accounting dictionary is used as an important professional corpus and an authority specification text of the accounting field, and the system comprehensively covers related terms and definitions of the accounting field. If a computer can read accounting text by the aid of semantic elements extracted from an accounting dictionary, a large amount of information in an accounting field can be effectively utilized, and therefore, subjective analysis and small sample data limitation in semantic element extraction are effectively broken through by term research based on the accounting dictionary.

Description

Method for constructing accounting term co-occurrence network diagram
Technical Field
The invention relates to the technical field of readability of financial information by a computer, in particular to a method for constructing an accounting term co-occurrence network diagram.
Technical Field
At present, network financial reports in the accounting field lack standardized knowledge description, so difficulties are encountered in the aspect of solving the readability of financial information by a computer, and the use breadth and the development prospect of network financial reports such as XBRL (extensible business reporting language) are hindered. At present, few scholars try to solve the difficulty of semantic element extraction based on currently popular machine learning algorithms, although the methods effectively reduce labor and time cost, the extracted terms have a lot of noises, the domain characteristics are not outstanding, and the validity of the extracted terms cannot be verified. The research of the invention fills the blank of the research of the network financial report, researches the key problem of 'extraction of core language' in the XBRL financial report, introduces the concept of semantic elements, aims to enhance the semantic characteristics of knowledge expression in the accounting field, and is beneficial to enhancing the accuracy and efficiency of machine identification information.
The successful extraction of semantic elements is helpful for promoting the quality level of general classification standards of accountants, enhancing the readability of a computer on financial information, improving the accuracy and efficiency of the acquirement of the financial information by stakeholders, reducing the technical barriers to the application and popularization of network financial reports, and promoting the initiative of enterprises in adopting the network financial reports. From a longer-term and macroscopic perspective, the research of the invention can improve the accuracy and authenticity of information disclosure, can avoid financial counterfeiting of enterprises to a certain extent, is beneficial to protecting the legal rights and interests of stakeholders, maintains the information quality of the market, and has certain practical significance.
From the prior art, few people try to solve the ontology construction difficulty based on the currently popular machine learning algorithm, and although the methods effectively reduce labor and time costs, the extracted terms have a lot of noises, are not outstanding in domain characteristics and lack practicability. Generally, semantic primitive extraction is classified into extraction methods based on linguistics, statistics, machine learning, graph theory and the like, but the methods have certain limitations, specifically:
1. the current research only stays at the vocabulary level and does not go deep into the semantic level.
It can be seen from the related research that most of the current research aiming at the problem stays at the vocabulary level, that is, the semantic material used for constructing the ontology is considered as the set of concepts and redundant information required by the ontology, and the research method adopted by the author is to wash and screen the information to a certain extent, and finally obtain the keywords conforming to the index system as the concepts required by the ontology. However, this research method is limited to selected semantic materials, wherein the semantic materials processed by the extraction method based on linguistics are small in scale, while the method based on statistics and machine learning can process large-scale texts, but the extracted terms have a lot of noises, are not outstanding in domain characteristics and lack semantic characteristics.
2. The degree of conformity with domain knowledge is insufficient.
As can be seen from the relevant research of the semantic element extraction method, the text material used for extracting the semantic elements lacks professional authority or lacks the participation of domain experts, so that the extracted semantic elements and domain knowledge have insufficient fitting degree. The invention selects the accounting dictionary as an authoritative and comprehensive text material in the field of financial reports, invites field experts to participate in the analysis of the effectiveness and superiority of the method, and aims to ensure higher matching degree between the extracted semantic elements and the field knowledge to the maximum extent.
3. The research visual angle is single, and the cross-field research result is less.
In the existing research, more concept extractions are keywords from text materials, and the research process is to remove redundant information from the text materials, so that the research view is single, and the cross-field research results are less.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a method for constructing an accounting term co-occurrence network diagram. The dictionary is taken as an authority specification text of the accounting field, the system completely covers related terms and definitions of the accounting field, if a computer can read and understand the accounting dictionary, a large amount of information of the accounting field can be effectively utilized, and therefore the limit of subjective analysis and small sample data in semantic element extraction is effectively broken through by research based on the accounting dictionary.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
a method for constructing an accounting term co-occurrence network graph comprises the steps of extracting semantic elements of an accounting field, constructing a directed network graph through an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, merging based on synonym forest, and finally obtaining a candidate set of the semantic elements.
It should be noted that the method specifically includes:
s1, manually extracting and sorting the definition text of the accounting terms, and summarizing the definition text in Excel;
s2, performing text word segmentation, stop word removal and duplicate removal on the summary Excel in the step S1;
s3, constructing an accounting term directed network graph;
s4, constructing a network graph based on the accounting dictionary of the step S3, and then calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction;
s5, after calculating the words with higher PageRank values, merging semantic primitives based on the synonym forest to obtain a final candidate set of semantic primitives.
It should be noted that, in the step S2, the jieba package carried by the Python itself is used for word segmentation, and it is noted that, in order to ensure the completeness of the accounting terms, it is necessary to import the accounting terms in the accounting dictionary into the custom dictionary, and establish a stop vocabulary to perform the de-duplication processing on the words in the definition text of each term.
In step S3, the text is constructed with a directed loop graph according to the word segmentation result; the method comprises the steps of taking words and defined text words after word segmentation as nodes, wherein a directed edge is arranged between the words and the defined text words, particularly, the words point to a plurality of defined text words, and if another word B appears in a defined text of a word A, a directed edge is arranged between the words A and the words B, particularly, the word A points to a directed edge of the words B.
It should be noted that, in the step S5, the extracted semantic elements are concentrated in the non-accounting term set, and there are words with different definitions in similar forms in the extracted semantic elements based on the multi-style of the language expression of the accounting dictionary in the compiling process, so that the words need to be merged, and the expression efficiency of the semantic elements is ensured to a greater extent.
It should be noted that, the core program in step S4 in the present invention is:
pr=centrality(G,‘pagerank’,‘Followprobability’,0.85)
G.Nodes.PageRank=pr
G.Nodes.InDegree=indegree(G)
G.Nodes.OutDegree=outdegree(G)
nodes% View PR score and level information for each node
plot(G,‘NodeLabel’,{},‘NodeColor’,[0.93 0.78 0],‘Layout’,‘force’)
title ('PageRank')% Chart drawing Using forced layout%
pr=centrality(G,‘pagerank’,‘MaxIterations’,200,‘FollowProbability’,0.85)
% PageRank score for G is calculated using 200 iterations and a damping factor of 0.85, the score and level information is added to the node table of the graph
G.Nodes=sortrows(G.Nodes,‘PageRank’,‘descend’)
% decreasing arrangement by PR value
H=subgraph(G,find(G.Nodes.PageRank>0.005))
plot(H,‘NodeLabel’,{},‘NodeCData’,H.Nodes.PageRank,‘Layout’,‘force’)
title(‘PageRank’)
colorbar
% extracts and draws subgraphs containing all nodes with scores greater than 0.005, coloring them according to the PageRank score of the graph nodes.
The invention has the advantages that:
1. the expression characteristics of the financial reports and the financial information elements are analyzed, and the term structure characteristics of the financial information elements are summarized. Firstly, combining a qualitative method and a quantitative method to analyze the characteristics of the financial report on the structure and expression level; then, taking an XBRL general classification standard element list as a core corpus, and manually dividing to obtain the structural regularity of terms in the element list, wherein the structural regularity comprises a main information bearing part of a core word and an additional modification component for expressing the related attributes of the terms, and the structural characteristic provides guidance and basis for extracting semantic elements.
2. The comprehensiveness and the scalability of semantic element extraction are considered. Firstly, an accounting dictionary directed graph is constructed, and the fact that each node only has two conditions of 'yes/no on a loop' is analyzed, so that points on the loop are extracted by using a PageRank value, and if the nodes are not on the loop, points with the degree of 0 are selected, so that the comprehensiveness and the scientificity of semantic element extraction are guaranteed; in addition, the invention combines the preliminarily extracted semantic elements by using the synonym forest, ensures the expression efficiency of the semantic elements to a greater extent and realizes the maximum domain knowledge range expressed by the minimum semantic element scale.
Drawings
FIG. 1 is a diagram illustrating the construction of a directed loop according to Table 1;
FIG. 2 is a loop diagram and an example of a PageRank value distribution according to the present invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The invention relates to a method for constructing an accounting term co-occurrence network graph, which comprises the steps of extracting semantic elements in an accounting field, constructing a directed network graph through an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, and merging based on synonym forest to finally obtain a candidate set of the semantic elements.
It should be noted that the method specifically includes:
s1, manually extracting and sorting the definition text of the accounting terms, and summarizing the definition text in Excel;
s2, performing text word segmentation, stop word removal and duplicate removal on the summary Excel in the step S1;
s3, constructing an accounting term directed network graph;
s4, constructing a network graph based on the accounting dictionary of the step S3, and then calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction;
s5, after calculating the words with higher PageRank values, merging semantic primitives based on the synonym forest to obtain a final candidate set of semantic primitives.
It should be noted that, in the step S2, the jieba package carried by the Python itself is used for word segmentation, and it is noted that, in order to ensure the completeness of the accounting terms, it is necessary to import the accounting terms in the accounting dictionary into the custom dictionary, and establish a stop vocabulary to perform the de-duplication processing on the words in the definition text of each term.
In step S3, the text is constructed with a directed loop graph according to the word segmentation result; the method comprises the steps of taking words and defined text words after word segmentation as nodes, wherein a directed edge is arranged between the words and the defined text words, particularly, the words point to a plurality of defined text words, and if another word B appears in a defined text of a word A, a directed edge is arranged between the words A and the words B, particularly, the word A points to a directed edge of the words B.
It should be noted that, in the step S5, the extracted semantic elements are concentrated in the non-accounting term set, and there are words with different definitions in similar forms in the extracted semantic elements based on the multi-style of the language expression of the accounting dictionary in the compiling process, so that the words need to be merged, and the expression efficiency of the semantic elements is ensured to a greater extent.
It should be noted that, the core program in step S4 in the present invention is:
pr=centrality(G,‘pagerank’,‘Followprobability’,0.85)
G.Nodes.PageRank=pr
G.Nodes.InDegree=indegree(G)
G.Nodes.OutDegree=outdegree(G)
nodes% View PR score and level information for each node
plot(G,‘NodeLabel’,{},‘NodeColor’,[0.93 0.78 0],‘Layout’,‘force’)
title ('PageRank')% Chart drawing Using forced layout%
pr=centrality(G,‘pagerank’,‘MaxIterations’,200,‘FollowProbability’,0.85)
% PageRank score for G is calculated using 200 iterations and a damping factor of 0.85, the score and level information is added to the node table of the graph
G.Nodes=sortrows(G.Nodes,‘PageRank’,‘descend’)
% decreasing arrangement by PR value
H=subgraph(G,find(G.Nodes.PageRank>0.005))
plot(H,‘NodeLabel’,{},‘NodeCData’,H.Nodes.PageRank,‘Layout’,‘force’)
title(‘PageRank’)
colorbar
% extracts and draws subgraphs containing all nodes with scores greater than 0.005, coloring them according to the PageRank score of the graph nodes.
Examples
Simulation experiment
The invention takes a Chinese financial economy publishing society's old & present pool ' English-Chinese modern financial institute big dictionary ' as experimental data in 2009, from which 4289 accounting terms and 32086 terms are arranged as the text of the experimental accounting field.
The programs and software mainly used for processing data include: excel2016, Python3.7, MATLAB R2016a, etc., where Excel is used for structured arrangement of accounting dictionaries, the definition of terms is cut by using the jieba package of Python, and a directed loop graph is drawn based on MATLAB and a PageRank value is calculated. The specific work is as follows:
(1) and manually extracting and sorting the definition text of the accounting terms.
According to the text analysis of the accounting dictionary, in the dictionary, not only a definitional description but also non-definitional descriptions such as examples and calculation formulas exist for the paraphrase of a certain accounting term, and the part is a redundant part for the extraction of semantic primitives, so that the invention manually extracts and arranges the definitional text of the accounting term, and the definitional text is summarized in Excel.
(2) And cutting words of the text, removing stop words and removing duplication.
Then, the word segmentation is carried out by using the jieba package carried by the Python, and it is noted that in order to ensure the completeness of the accounting terms, 4289 accounting terms in the accounting dictionary are imported into the custom dictionary, then a dead word list is established, and the vocabulary in the definition text of each term is subjected to the de-duplication processing.
Table 1 example of word segmentation part of accounting dictionary
Figure BDA0003269235210000091
Figure BDA0003269235210000101
(3) And constructing an accounting term directed network graph.
According to the word segmentation result, the construction of a directed loop graph can be carried out on the texts, as shown in fig. 1. The specific construction idea is that the vocabulary and the definition text vocabulary after word segmentation are taken as nodes, a directed edge is arranged between the vocabulary and the definition text vocabulary, specifically, the vocabulary points to a plurality of definition text vocabularies, and if another vocabulary B (such as rent) appears in the definition text of a vocabulary A (such as rent), a directed edge is arranged between A, B, specifically, a directed edge of A points to B. The above relationship is graphically depicted.
(4) The PageRank value is calculated.
After a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic primitive extraction.
As can be seen from FIG. 2, PR values of leaf nodes are generally higher, and since semantic primitives are used to interpret other words and cannot be interpreted by themselves, extraction of semantic primitives should be on the leaf nodes, consistent with the analysis above. For nodes on the loop, the point with the highest PR value is selected as the semantic primitive by the method.
Preliminary extraction of results
Semantic primitive ordering obtainable by screening based on PageRank:
TABLE 2 example accounting terms top 20 PageRank
Figure BDA0003269235210000102
Figure BDA0003269235210000111
After the words with higher PR values are calculated, the following processing needs to be carried out to achieve the accuracy and the scale of extraction:
(1) semantic primitives mainly exist at the point where the PR value in the loop is maximum and at the leaf nodes in the non-loop
One of the terms is a point with an out-degree of 0, such as "construction project", "share", "total", "find", "record", and "provision" in the above table, where the out-degree of these terms is 0, which indicates that the nodes are located at leaf nodes in the directed graph, and the PageRank values of these nodes are high, but it can be noted that in this case, synonyms are present in the extracted nodes, which results in a large size of the extracted semantic elements, so the present invention combines terms with similar definitions based on synonym forest, such as "share" and "total", which can be expressed by only one of the terms.
The other is that the point where the PR value is the maximum in the loop, such as "asset", "income", and "shares", is in the same loop, PR (asset) > PR (income) > PR (shares), so that "asset" will be extracted as semantic primitive.
(2) Semantic primitive merging based on synonym forest
The semantic elements bear the ability of expressing the domain knowledge, but if the vocabularies in the semantic elements are many, the meaning of extracting the semantic elements is not large, so that the accuracy of knowledge expression is ensured, the efficiency of knowledge expression is also considered, and the maximum domain knowledge range can be expressed by the minimum semantic element scale. The semantic elements extracted by the research are concentrated in a non-accounting term set, and the extracted semantic elements have words with different definitions in similar forms based on the multi-style of language expression of an accounting dictionary in the compiling process, so that the words need to be combined, and the expression efficiency of the semantic elements is ensured to a greater extent.
Synonym forest merger
In order to improve the efficiency of knowledge expression, the preliminary extraction results are merged based on synonym forest. And installing a WordSimiarity module for calculating the similarity, reading all terms in the Excel file of the preliminary extraction result, and selecting 0.8 as the specified similarity in the invention to ensure the maximum application efficiency. If the similarity of the two terms is greater than 0.8, the same line in the new Excel file is written.
The semantic primitive parts merged by the synonym forest are shown in Table 5-3.
TABLE 3 synonym forest Merge
Figure BDA0003269235210000131
With this similarity calculation method, "high similarity" may mean "high similarity", or "high correlation", and for example, in the similarity defined as 0.8, "other person" is grouped with "oneself" and "self" is not grouped with 0.9. But it is clear that in 0.9 there are many fewer that can be grouped together.
And some terms are not merged in Excel, and the reasons are mainly as follows:
(1) most of the tables are proper nouns, which are not in wordsimirity-coded wordsiries, so that the similarity cannot be calculated.
(2) Many words in the text are synthesized words, and the text is composed of two words, such as 'employee fraud' and 'buyer and seller', which is not beneficial to calculating the similarity.
Example 2
Verifying the effectiveness of the invention
The model provided by the invention is adopted to extract semantic elements of the accounting field, a directed network graph is constructed for an accounting dictionary, an improved PageRank algorithm, namely a PRFR algorithm, is utilized to extract the semantic elements and describe the field knowledge, then a candidate set of the final semantic elements is obtained based on synonym forest combination, and a method based on word frequency and a method based on TF-IDF are used as reference experiments for comparative analysis.
(1) Word frequency based method
The term frequency-based method ranks terms according to frequency by counting occurrence frequency of the terms, and takes the top50 terms as semantic elements of the accounting field, as shown in table 4.
TABLE 4 semantic element extraction based on word frequency method
Figure BDA0003269235210000141
Figure BDA0003269235210000151
It can be found that among the candidate vocabularies obtained based on the word frequency method, Top10 has 8 vocabularies: the terms of "enterprise", "accounting", "company", "income", "commodity", "cash", "payment" and "amount" are words with too wide semantics, belong to high-frequency words in other subjects, and do not well represent the research basis of the accounting field, and the terms of "enterprise" and "company" are often defined as synonyms, and only "asset" and "cost" represent two accounting factors in the accounting field, and can be used as semantic elements of the accounting field. The scope is further expanded to the candidate words of Top30, and only the words such as "accounting statement", "audit", "fee" and the like can be used as semantic elements for characterizing the accounting field. Likewise, extending to the Top50 sample, semantic primitive and non-primitive terms also alternate. Therefore, although the word frequency-based method can find words with high frequency and high research heat in the field, the words are often superior words across the field or irrelevant words outside the field, the basic characterization capability of the research in a specific field is insufficient, the method which only depends on the word frequency is not ideal in the recognition and research of semantic elements, and especially when small-scale semantic elements are required to be screened as research objects, the basic words can not be extracted through the word frequency ranking to meet the actual requirement.
(2) TF-IDF-based method
The TF-IDF algorithm is adopted to rank the accounting terms, a semantic primitive candidate set of the accounting field is obtained according to the rank, and the semantic primitive candidate terms with the TF-IDF value of the top50 are intercepted and shown in the table 5.
Candidate primitives Ranking Candidate primitives Ranking
Accountant 1 Currency unit 26
Gain of 2 Property and its use 27
Assets 3 Debt affairs 28
Construction engineering 4 Stock certificate 29
Shares of stock 5 Issue(s) 30
Accounting statement 6 Invoice 31
Cash money 7 Creditor 32
Payment 8 Income (R) 33
Amount of money 9 Production of 34
Cost of 10 Audit teacher 35
Recording 11 Labor affairs 36
Sale 12 Shareholder 37
Audit reports 13 Bond and its making method 38
Account 14 Portion(s) of 39
Security document 15 Debt 40
Administration 16 Money order 41
Auditing 17 Expenditure of 42
Income of business 18 Profit and loss 43
Finance affairs 19 Bill 44
Cost of 20 Financial status 45
Data of 21 Decision making 46
Capital 22 Contract (contract) 47
Interest information 23 Manager 48
Contractor 24 Property right 49
Bank 25 Registration 50
It can be found that the 3 terms that can represent the knowledge of the accounting field are newly identified in the candidate word Top10 obtained based on the TF-IDF, such as "construction project", "accounting report", "stock". In the overall Top50 sample, the term coincidence proportion in the semantic primitive candidate set obtained by the two methods is 78%, that is, 39 candidate primitives belong to the two methods at the same time, and the difference is that the bit sequence order of partial terms is changed. Therefore, the ranking result based on TF-IDF is slightly better than the method based on word frequency as a whole, and some nodes with low frequency but more important can be ranked at the front position through TF-IDF. But simultaneously, a large amount of repetition of candidate words obtained by the two methods can be found, which shows that TF-IDF indexes are still linearly related to word frequency, and the characterization capability of the obtained candidate base selection on the knowledge in the accounting field is still limited.
(3) Method based on the model of the invention
And (3) constructing a directed network graph according to the co-occurrence relation of the accounting dictionary, ranking by using a PRFR algorithm, and obtaining an accounting field semantic primitive candidate set according to the ranking height, wherein the table is shown in a table 6-3.
Table 6 semantic element extraction based on this model
Figure BDA0003269235210000171
Figure BDA0003269235210000181
It can be found that, among candidate elements obtained based on the model method, "share", "asset", "rent", "cost", etc. all represent basic research directions and techniques of the accounting field, and can be defined as semantic elements of the accounting field, and only "adopt" and "information" in Top10 do not belong to field vocabulary. The scope is further expanded to the candidate words of Top30, and only a few terms such as "written", "efficiency", "none", etc. do not belong to the domain vocabulary, but can also be used as the semantic element of the accounting domain. In the Top50 sample, it can be seen that the proportion of the domain elements in the candidate vocabulary is higher than that of the non-primitive terms, and the important domain elements are ranked higher. In addition, the method of the invention uses synonym forest combination, so that the words with the same word senses can not appear, and the coverage of the obtained semantic primitive word senses is larger. Therefore, the semantic element extraction method based on the model has better effect than word frequency and TF-IDF, can find some important knowledge units which are not high in frequency but are positioned at core nodes in a network, and most terms with the top rank are semantic elements, so that the model method provided by the invention is effective and feasible, and can exert greater advantages in the case of task repetition needing to extract a small-range semantic element.
Quantitative evaluation based on blind selection experiment
The analysis discusses the experimental results from a qualitative perspective, and in order to further quantitatively evaluate the experimental results of the method, the invention designs a quantitative evaluation method based on blind selection experiments with reference to other documents. And in the blind selection experiment, three experimental results of word frequency, TF-IDF and the model method are used as objects for evaluation. The specific evaluation process is as follows: mixing the semantic element sets obtained by the three experiments, disordering the sequence to obtain 87 non-repeated candidate terms, and inviting the experimenter to select terms capable of representing the accounting field from the candidate words. Invitees are scientific researchers who have many years of research experience and are engaged in related research in the accounting field, totaling three people.
And counting the number and the proportion of semantic elements contained in the three methods respectively attributed to the vocabularies selected by each experimenter. Since the number of terms provided by the three methods in the candidate term set is equal, it can be considered that the method has better effect when the experimenter selects more words from which method. The results of the blind selection experiments are shown in table 7. Methods 1 to 3 correspond to a word frequency-based method, a TF-IDF-based method and a method for extracting semantic elements based on the model respectively.
TABLE 7 Blind selection experimental results
Figure BDA0003269235210000191
It can be seen that in the semantic elements obtained through blind selection experiments, the coincidence proportion of the traditional word frequency and the TF-IDF method is almost the same, the coincidence proportion based on the model method is far higher than that of the traditional word frequency and the TF-IDF method, the average accuracy rate of the model method reaches 66.71%, and the semantic element extraction method of the model can better fit the results of manual screening of experts to a certain extent.
Meanwhile, in practical applications, only a small part of basic vocabulary is required to be screened, so that the accuracy of the three methods at the nth position is further observed by using a p (N) index (N ═ 10,20,30,40,50), and the result is shown in table 8.
TABLE 8 accuracy of blind selection experiment
Method P(10) P(20) P(30) P(40) P(50)
Method 1 0.37 0.60 0.68 0.63 0.62
Method 2 0.43 0.62 0.60 0.62 0.62
Method 3 0.73 0.75 0.69 0.73 0.73
It can be seen that the accuracy of the model-based method at each position is significantly higher than the accuracy of the word frequency method and the accuracy of the TF-IDF at the corresponding position, the average accuracy reaches 72.6%, wherein the indexes of P (10) and P (20) respectively reach 73% and 75%, that is, 7 words in the first 10 candidate words belong to the field elements, and 15 words in the first 20 candidate words belong to the basic vocabulary, so as to achieve a better recognition result. And TF-IDF is slightly higher than word frequency method in P (10) and P (20) indexes, and the difference between the indexes of P (30), P (40) and P (50) is not large, which indicates that TF-IDF is superior to word frequency method in the task of extracting small-scale semantic elements, and when the number of returned result samples is large, the difference between the two methods is not obvious.
In a whole view, when the method based on the model identifies the domain semantic elements, the domain elements with high importance can be found better through PageRank ranking, the semantic element word meaning coverage obtained based on synonym forest combination is larger, the condition that a large number of words with wide semantics and repetition in the result obtained by depending on word frequency and TF-IDF are ranked ahead is avoided, and the method has better expression and higher application value in finding the domain elements.
Semantic primitive to element manifest expressiveness
Based on the analysis of the vocabulary characteristics of the element list, the elements are found to have certain structural regularity, and the specific structure is summarized as follows. The structure of the element (G) is mainly composed of a core word, a time modifier, a space modifier, a cause-and-effect modifier, a general modifier, a state indicator and the like.
Structural categories of XBRL generic classification criteria financial information elements
Term + example word
Term + general Properties
Term + general Properties + example words
Term + causal Properties
Term + causal Property + instance word
Term + time attribute + instance word
Term + spatial Properties + instance words
Term + time attribute
Term + spatial Property
Term + instance word and term + instance word
Term + example word + general Property
Term + time attribute + instance word + cause and effect attribute
Term + time attribute + causal attribute
Term + time attribute + cause and effect attribute + instance word
Term + time attribute + cause and effect attribute + term + instance word
Term + general attribute + time attribute + instance word
Term + general attribute + time attribute ++ causal attribute + instance word
General Property + term general Property + instance word
Generic Property + terms + instance words
General attribute + term + temporal attribute + instance word
General Properties + terms + spatial Properties
General Property + term + general Property
General Property + term + general Property ++ time Property + instance word
General attribute + term + causal attribute
General attribute + term + time attribute + causal attribute
General Property + terms + instance words + general Properties
General Property + temporal Property + terminology
General attribute + temporal attribute + term + instance word
General attribute + time attribute + cause and effect attribute + term + instance word
General attribute + temporal attribute + term + instance word
General attribute + term + temporal attribute + instance word
General attribute + temporal attribute + term + instance word
General attribute + term + time attribute + instance word + general attribute
Time attribute + term
Time attribute + term + instance word
Time attribute + cause and effect attribute + term + instance word
Time attribute + general attribute + term + instance word
Temporal attribute + spatial attribute + term + instance word
Time attribute + general attribute + term
Time attribute + causal attribute + term + causal attribute
Time attribute + term + time attribute + instance word
Spatial Property + terminology
Spatial attributes + terms + instance words
Space attribute + general attribute + term + example word
Causal attributes + terms:
causal Property + term + instance word
Causal Property + term + general Property
Causal attribute + term + time attribute + instance word
Causal attribute + general attribute + term + instance word
For example:
g: fixed asset current reduction ═ Hx: fixed asset, Sj: at this stage, Zx: reduction >
Where "fixed asset" and "current date" are accounting terms and "reduction" is a well-defined non-accounting term, then the primitives for "current date reduction of fixed asset" based on the extracted semantic primitives are interpreted as:
g: fixed asset cost reduction is benefit period + year + over + asset + factory building and equipment + accounting period + reduction
As can be seen from the primitive expressions described above, the understandability of the extensibility terms is enhanced, and the primitives used for expression summarize the attribute of "fixed asset current reduction" from different perspectives.
From the above analysis, it can be known that the term in the XBRL general classification standard financial information element list can be divided into an accounting term and a non-accounting term after word segmentation, and the accounting term corresponds to an accounting dictionary and has corresponding semantic primitive expressions, and the non-accounting term has a definite definition, so that the amount of the accounting term in the element list is larger than that of the non-accounting term, which indicates that the semantic primitive can realize effective expression for the elements in the element list.
However, to measure the strength of effective expression, the intersection of the term to be calculated and the dictionary to be calculated needs to be taken, and through statistics, the term in the accounting dictionary can realize the full coverage of the word after the element is subjected to word segmentation, so that the extracted semantic elements can realize the strong expression capability on the element list.
Semantic primitive to instance expressiveness
Based on the discourse characteristic analysis of the financial report, the financial report is a discourse with clear hierarchy and clear structure, the financial report reveals financial information according to all levels of titles, and all levels of titles correspond to all items of basic accounting criteria of an enterprise, the whole financial report is in a tree structure, and the internal logic structure is strict. Meanwhile, the text content under the subtitle is interpreted in units of sections with the related information around the disclosed event. Therefore, to realize knowledge representation of financial reports, the present invention can realize reading of financial reports by machine through the use of the phrase type headings in the section headings and paragraphs. The method comprises the following concrete steps:
step 1: carrying out hierarchical division on the unstructured annual report document to obtain chapter titles and paragraph subtitles;
step 2: performing word segmentation and part-of-speech tagging on the chapter titles and the paragraph subtitles, and taking vocabularies as processing units;
and step 3: and obtaining corresponding primitive attributes based on the semantic primitive set to be used as the knowledge representation of the subtitles.
Finally, the effectiveness of the model is verified through the method, the superiority of the model is contrastively analyzed through a qualitative experiment based on the word frequency and TF-IDF as the reference, and the effectiveness of the model is quantitatively evaluated through a blind selection experiment; and finally, the expression of the financial report knowledge is completed based on the extracted semantic elements. The result shows that when the method based on the model is used for identifying the domain primitives, the domain primitives with high importance can be found better through PRFR ranking, the semantic primitive word sense coverage obtained based on synonym forest combination is larger, the condition that a large number of words with wide semantics and repetition in the result obtained by depending on word frequency and TF-IDF are ranked ahead is avoided, and the basic expression of financial reports can be realized based on the semantic primitives, so that the method has better expression and higher application value in the expression of domain knowledge.
Various modifications may be made by those skilled in the art based on the above teachings and concepts, and all such modifications are intended to be included within the scope of the present invention as defined in the appended claims.

Claims (6)

1. A method for constructing an accounting term co-occurrence network graph is characterized by comprising the steps of extracting semantic elements of an accounting field, constructing a directed network graph through an accounting dictionary of important professional linguistic data of the accounting field, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, merging based on synonym forest, and finally obtaining a candidate set of the semantic elements.
2. The method for constructing an accounting term co-occurrence network diagram according to claim 1, wherein the method specifically comprises:
s1, manually extracting and sorting the definition text of the accounting terms, and summarizing the definition text in Excel;
s2, performing text word segmentation, stop word removal and duplicate removal on the summary Excel in the step S1;
s3, constructing an accounting term directed network graph;
s4, constructing a network graph based on the accounting dictionary of the step S3, and then calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction;
s5, after calculating the words with higher PageRank values, merging semantic primitives based on the synonym forest to obtain a final candidate set of semantic primitives.
3. The method as claimed in claim 2, wherein in the step S2, the word segmentation is performed by using a jieba package carried by Python, and it is noted that, in order to ensure the completeness of the accounting terms, it is necessary to import the accounting terms in the accounting dictionary into the custom dictionary and establish a stop vocabulary to perform de-duplication on the words in the definition text of each term.
4. The method for constructing an accounting term co-occurrence network diagram according to claim 2, wherein in the step S3, the text is constructed into a directed loop diagram according to the word segmentation result; the method comprises the steps of taking words and defined text words after word segmentation as nodes, wherein a directed edge is arranged between the words and the defined text words, particularly, the words point to a plurality of defined text words, and if another word B appears in a defined text of a word A, a directed edge is arranged between the words A and the words B, particularly, the word A points to a directed edge of the words B.
5. The method as claimed in claim 2, wherein in step S5, the extracted semantic elements are concentrated in the non-accounting term set, and based on the multiple patterns of the language expression of the accounting dictionary during the composition process, the extracted semantic elements have different vocabularies with similar definitions, so that the vocabularies need to be merged to ensure the expression efficiency of the semantic elements to a greater extent.
6. The method for constructing an accounting term co-occurrence network diagram according to claim 2, wherein the core procedure in the step S4 is:
pr=centrality(G,‘pagerank’,‘Followprobability’,0.85)
G.Nodes.PageRank=pr
G.Nodes.InDegree=indegree(G)
G.Nodes.OutDegree=outdegree(G)
nodes% View PR score and level information for each node
plot(G,‘NodeLabel’,{},‘NodeColor’,[0.93 0.78 0],‘Layout’,‘force’)
title ('PageRank')% Chart drawing Using forced layout%
pr=centrality(G,‘pagerank’,‘MaxIterations’,200,‘FollowProbability’,0.85)
% PageRank score for G is calculated using 200 iterations and a damping factor of 0.85, the score and level information is added to the node table of the graph
G.Nodes=sortrows(G.Nodes,‘PageRank’,‘descend’)
% decreasing arrangement by PR value
H=subgraph(G,find(G.Nodes.PageRank>0.005))
plot(H,‘NodeLabel’,{},‘NodeCData’,H.Nodes.PageRank,‘Layout’,‘force’)
title(‘PageRank’)
colorbar
% extracts and draws subgraphs containing all nodes with scores greater than 0.005, coloring them according to the PageRank score of the graph nodes.
CN202111096537.9A 2021-09-18 2021-09-18 Method for constructing accounting term co-occurrence network diagram Pending CN113919342A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111096537.9A CN113919342A (en) 2021-09-18 2021-09-18 Method for constructing accounting term co-occurrence network diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111096537.9A CN113919342A (en) 2021-09-18 2021-09-18 Method for constructing accounting term co-occurrence network diagram

Publications (1)

Publication Number Publication Date
CN113919342A true CN113919342A (en) 2022-01-11

Family

ID=79235723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111096537.9A Pending CN113919342A (en) 2021-09-18 2021-09-18 Method for constructing accounting term co-occurrence network diagram

Country Status (1)

Country Link
CN (1) CN113919342A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797635A (en) * 2020-07-14 2020-10-20 暨南大学 Semantic element extraction method for XBRL field ontology
CN112183110A (en) * 2020-09-28 2021-01-05 贵州云腾志远科技发展有限公司 Artificial intelligence data application system and application method based on data center
US20210097238A1 (en) * 2017-08-29 2021-04-01 Ping An Technology (Shenzhen) Co., Ltd. User keyword extraction device and method, and computer-readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210097238A1 (en) * 2017-08-29 2021-04-01 Ping An Technology (Shenzhen) Co., Ltd. User keyword extraction device and method, and computer-readable storage medium
CN111797635A (en) * 2020-07-14 2020-10-20 暨南大学 Semantic element extraction method for XBRL field ontology
CN112183110A (en) * 2020-09-28 2021-01-05 贵州云腾志远科技发展有限公司 Artificial intelligence data application system and application method based on data center

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
叶迪: "面向XBRL领域本体的语义基于提取方法", 《中国优秀博硕士学位论文全文数据库(硕士) 经济与管理科学辑》 *

Similar Documents

Publication Publication Date Title
US7849049B2 (en) Schema and ETL tools for structured and unstructured data
EP1899855B1 (en) System and method of making unstructured data available to structured data analysis tools
Conrad et al. Opinion mining in legal blogs
CN110188344A (en) A kind of keyword extracting method of multiple features fusion
US8060505B2 (en) Methodologies and analytics tools for identifying white space opportunities in a given industry
US20070011183A1 (en) Analysis and transformation tools for structured and unstructured data
CN110309400A (en) A kind of method and system that intelligent Understanding user query are intended to
CN110134847A (en) A kind of hot spot method for digging and system based on internet Financial Information
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
CN109492097B (en) Enterprise news data risk classification method
Yang et al. A framework for web table mining
Shirata et al. An analysis of the “going concern assumption”: Text mining from Japanese financial reports
Sadasivam et al. Corporate governance fraud detection from annual reports using big data analytics
Goel et al. Mining company sustainability reports to aid financial decision-making
Berkin et al. Feasibility analysis of machine learning for performance-related attributional statements
Li et al. automatically detecting peer-to-peer lending intermediary risk—Top management team profile textual features perspective
Musliadi et al. Twitter Social Media Conversion Topic Trending Analysis Using Latent Dirichlet Allocation Algorithm
Wang et al. E-business websites evaluation based on opinion mining
CN113919342A (en) Method for constructing accounting term co-occurrence network diagram
CN110134866A (en) Information recommendation method and device
Chakraborty et al. Automating the process of taxonomy creation and comparison of taxonomy structures
Lee et al. An annotated commodity news corpus for event extraction
Jin et al. Diagnosis of corporate insolvency using massive news articles for credit management
Luo et al. A latent dirichlet allocation and fuzzy clustering based machine learning model for text thesaurus
CN112966105B (en) Method for automatically generating audit test questions by using violation problem analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220111