CN113919342A

CN113919342A - Method for constructing accounting term co-occurrence network diagram

Info

Publication number: CN113919342A
Application number: CN202111096537.9A
Authority: CN
Inventors: 潘定; 梁倬骞; 叶迪
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-11

Abstract

The invention discloses a method for constructing an accounting term co-occurrence network graph, which comprises the steps of extracting semantic elements in an accounting field, namely constructing a directed network graph for words in an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, and finally obtaining a candidate set of the semantic elements of accounting terms based on synonym forest combination. The invention designs a semantic element extraction method based on graph theory aiming at the linguistic data of an accounting dictionary by utilizing the characteristics of knowledge in the accounting field. The accounting dictionary is used as an important professional corpus and an authority specification text of the accounting field, and the system comprehensively covers related terms and definitions of the accounting field. If a computer can read accounting text by the aid of semantic elements extracted from an accounting dictionary, a large amount of information in an accounting field can be effectively utilized, and therefore, subjective analysis and small sample data limitation in semantic element extraction are effectively broken through by term research based on the accounting dictionary.

Description

Method for constructing accounting term co-occurrence network diagram

Technical Field

The invention relates to the technical field of readability of financial information by a computer, in particular to a method for constructing an accounting term co-occurrence network diagram.

Technical Field

At present, network financial reports in the accounting field lack standardized knowledge description, so difficulties are encountered in the aspect of solving the readability of financial information by a computer, and the use breadth and the development prospect of network financial reports such as XBRL (extensible business reporting language) are hindered. At present, few scholars try to solve the difficulty of semantic element extraction based on currently popular machine learning algorithms, although the methods effectively reduce labor and time cost, the extracted terms have a lot of noises, the domain characteristics are not outstanding, and the validity of the extracted terms cannot be verified. The research of the invention fills the blank of the research of the network financial report, researches the key problem of 'extraction of core language' in the XBRL financial report, introduces the concept of semantic elements, aims to enhance the semantic characteristics of knowledge expression in the accounting field, and is beneficial to enhancing the accuracy and efficiency of machine identification information.

The successful extraction of semantic elements is helpful for promoting the quality level of general classification standards of accountants, enhancing the readability of a computer on financial information, improving the accuracy and efficiency of the acquirement of the financial information by stakeholders, reducing the technical barriers to the application and popularization of network financial reports, and promoting the initiative of enterprises in adopting the network financial reports. From a longer-term and macroscopic perspective, the research of the invention can improve the accuracy and authenticity of information disclosure, can avoid financial counterfeiting of enterprises to a certain extent, is beneficial to protecting the legal rights and interests of stakeholders, maintains the information quality of the market, and has certain practical significance.

From the prior art, few people try to solve the ontology construction difficulty based on the currently popular machine learning algorithm, and although the methods effectively reduce labor and time costs, the extracted terms have a lot of noises, are not outstanding in domain characteristics and lack practicability. Generally, semantic primitive extraction is classified into extraction methods based on linguistics, statistics, machine learning, graph theory and the like, but the methods have certain limitations, specifically:

1. the current research only stays at the vocabulary level and does not go deep into the semantic level.

It can be seen from the related research that most of the current research aiming at the problem stays at the vocabulary level, that is, the semantic material used for constructing the ontology is considered as the set of concepts and redundant information required by the ontology, and the research method adopted by the author is to wash and screen the information to a certain extent, and finally obtain the keywords conforming to the index system as the concepts required by the ontology. However, this research method is limited to selected semantic materials, wherein the semantic materials processed by the extraction method based on linguistics are small in scale, while the method based on statistics and machine learning can process large-scale texts, but the extracted terms have a lot of noises, are not outstanding in domain characteristics and lack semantic characteristics.

2. The degree of conformity with domain knowledge is insufficient.

As can be seen from the relevant research of the semantic element extraction method, the text material used for extracting the semantic elements lacks professional authority or lacks the participation of domain experts, so that the extracted semantic elements and domain knowledge have insufficient fitting degree. The invention selects the accounting dictionary as an authoritative and comprehensive text material in the field of financial reports, invites field experts to participate in the analysis of the effectiveness and superiority of the method, and aims to ensure higher matching degree between the extracted semantic elements and the field knowledge to the maximum extent.

3. The research visual angle is single, and the cross-field research result is less.

In the existing research, more concept extractions are keywords from text materials, and the research process is to remove redundant information from the text materials, so that the research view is single, and the cross-field research results are less.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a method for constructing an accounting term co-occurrence network diagram. The dictionary is taken as an authority specification text of the accounting field, the system completely covers related terms and definitions of the accounting field, if a computer can read and understand the accounting dictionary, a large amount of information of the accounting field can be effectively utilized, and therefore the limit of subjective analysis and small sample data in semantic element extraction is effectively broken through by research based on the accounting dictionary.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a method for constructing an accounting term co-occurrence network graph comprises the steps of extracting semantic elements of an accounting field, constructing a directed network graph through an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, merging based on synonym forest, and finally obtaining a candidate set of the semantic elements.

It should be noted that the method specifically includes:

s1, manually extracting and sorting the definition text of the accounting terms, and summarizing the definition text in Excel;

s2, performing text word segmentation, stop word removal and duplicate removal on the summary Excel in the step S1;

s3, constructing an accounting term directed network graph;

s4, constructing a network graph based on the accounting dictionary of the step S3, and then calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction;

s5, after calculating the words with higher PageRank values, merging semantic primitives based on the synonym forest to obtain a final candidate set of semantic primitives.

It should be noted that, in the step S2, the jieba package carried by the Python itself is used for word segmentation, and it is noted that, in order to ensure the completeness of the accounting terms, it is necessary to import the accounting terms in the accounting dictionary into the custom dictionary, and establish a stop vocabulary to perform the de-duplication processing on the words in the definition text of each term.

In step S3, the text is constructed with a directed loop graph according to the word segmentation result; the method comprises the steps of taking words and defined text words after word segmentation as nodes, wherein a directed edge is arranged between the words and the defined text words, particularly, the words point to a plurality of defined text words, and if another word B appears in a defined text of a word A, a directed edge is arranged between the words A and the words B, particularly, the word A points to a directed edge of the words B.

It should be noted that, in the step S5, the extracted semantic elements are concentrated in the non-accounting term set, and there are words with different definitions in similar forms in the extracted semantic elements based on the multi-style of the language expression of the accounting dictionary in the compiling process, so that the words need to be merged, and the expression efficiency of the semantic elements is ensured to a greater extent.

It should be noted that, the core program in step S4 in the present invention is:

pr＝centrality(G,‘pagerank’,‘Followprobability’,0.85)

G.Nodes.PageRank＝pr

G.Nodes.InDegree＝indegree(G)

G.Nodes.OutDegree＝outdegree(G)

nodes% View PR score and level information for each node

plot(G,‘NodeLabel’,{},‘NodeColor’,[0.93 0.78 0],‘Layout’,‘force’)

title ('PageRank')% Chart drawing Using forced layout%

pr＝centrality(G,‘pagerank’,‘MaxIterations’,200,‘FollowProbability’,0.85)

% PageRank score for G is calculated using 200 iterations and a damping factor of 0.85, the score and level information is added to the node table of the graph

G.Nodes＝sortrows(G.Nodes,‘PageRank’,‘descend’)

% decreasing arrangement by PR value

H＝subgraph(G,find(G.Nodes.PageRank>0.005))

plot(H,‘NodeLabel’,{},‘NodeCData’,H.Nodes.PageRank,‘Layout’,‘force’)

title(‘PageRank’)

colorbar

% extracts and draws subgraphs containing all nodes with scores greater than 0.005, coloring them according to the PageRank score of the graph nodes.

The invention has the advantages that:

1. the expression characteristics of the financial reports and the financial information elements are analyzed, and the term structure characteristics of the financial information elements are summarized. Firstly, combining a qualitative method and a quantitative method to analyze the characteristics of the financial report on the structure and expression level; then, taking an XBRL general classification standard element list as a core corpus, and manually dividing to obtain the structural regularity of terms in the element list, wherein the structural regularity comprises a main information bearing part of a core word and an additional modification component for expressing the related attributes of the terms, and the structural characteristic provides guidance and basis for extracting semantic elements.

2. The comprehensiveness and the scalability of semantic element extraction are considered. Firstly, an accounting dictionary directed graph is constructed, and the fact that each node only has two conditions of 'yes/no on a loop' is analyzed, so that points on the loop are extracted by using a PageRank value, and if the nodes are not on the loop, points with the degree of 0 are selected, so that the comprehensiveness and the scientificity of semantic element extraction are guaranteed; in addition, the invention combines the preliminarily extracted semantic elements by using the synonym forest, ensures the expression efficiency of the semantic elements to a greater extent and realizes the maximum domain knowledge range expressed by the minimum semantic element scale.

Drawings

FIG. 1 is a diagram illustrating the construction of a directed loop according to Table 1;

FIG. 2 is a loop diagram and an example of a PageRank value distribution according to the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The invention relates to a method for constructing an accounting term co-occurrence network graph, which comprises the steps of extracting semantic elements in an accounting field, constructing a directed network graph through an accounting dictionary, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, and merging based on synonym forest to finally obtain a candidate set of the semantic elements.

It should be noted that the method specifically includes:

s3, constructing an accounting term directed network graph;

pr＝centrality(G,‘pagerank’,‘Followprobability’,0.85)

G.Nodes.PageRank＝pr

G.Nodes.InDegree＝indegree(G)

G.Nodes.OutDegree＝outdegree(G)

nodes% View PR score and level information for each node

title ('PageRank')% Chart drawing Using forced layout%

G.Nodes＝sortrows(G.Nodes,‘PageRank’,‘descend’)

% decreasing arrangement by PR value

H＝subgraph(G,find(G.Nodes.PageRank>0.005))

title(‘PageRank’)

colorbar

Examples

Simulation experiment

The invention takes a Chinese financial economy publishing society's old & present pool ' English-Chinese modern financial institute big dictionary ' as experimental data in 2009, from which 4289 accounting terms and 32086 terms are arranged as the text of the experimental accounting field.

The programs and software mainly used for processing data include: excel2016, Python3.7, MATLAB R2016a, etc., where Excel is used for structured arrangement of accounting dictionaries, the definition of terms is cut by using the jieba package of Python, and a directed loop graph is drawn based on MATLAB and a PageRank value is calculated. The specific work is as follows:

(1) and manually extracting and sorting the definition text of the accounting terms.

According to the text analysis of the accounting dictionary, in the dictionary, not only a definitional description but also non-definitional descriptions such as examples and calculation formulas exist for the paraphrase of a certain accounting term, and the part is a redundant part for the extraction of semantic primitives, so that the invention manually extracts and arranges the definitional text of the accounting term, and the definitional text is summarized in Excel.

(2) And cutting words of the text, removing stop words and removing duplication.

Then, the word segmentation is carried out by using the jieba package carried by the Python, and it is noted that in order to ensure the completeness of the accounting terms, 4289 accounting terms in the accounting dictionary are imported into the custom dictionary, then a dead word list is established, and the vocabulary in the definition text of each term is subjected to the de-duplication processing.

Table 1 example of word segmentation part of accounting dictionary

(3) And constructing an accounting term directed network graph.

According to the word segmentation result, the construction of a directed loop graph can be carried out on the texts, as shown in fig. 1. The specific construction idea is that the vocabulary and the definition text vocabulary after word segmentation are taken as nodes, a directed edge is arranged between the vocabulary and the definition text vocabulary, specifically, the vocabulary points to a plurality of definition text vocabularies, and if another vocabulary B (such as rent) appears in the definition text of a vocabulary A (such as rent), a directed edge is arranged between A, B, specifically, a directed edge of A points to B. The above relationship is graphically depicted.

(4) The PageRank value is calculated.

After a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic primitive extraction.

As can be seen from FIG. 2, PR values of leaf nodes are generally higher, and since semantic primitives are used to interpret other words and cannot be interpreted by themselves, extraction of semantic primitives should be on the leaf nodes, consistent with the analysis above. For nodes on the loop, the point with the highest PR value is selected as the semantic primitive by the method.

Preliminary extraction of results

Semantic primitive ordering obtainable by screening based on PageRank:

TABLE 2 example accounting terms top 20 PageRank

After the words with higher PR values are calculated, the following processing needs to be carried out to achieve the accuracy and the scale of extraction:

(1) semantic primitives mainly exist at the point where the PR value in the loop is maximum and at the leaf nodes in the non-loop

One of the terms is a point with an out-degree of 0, such as "construction project", "share", "total", "find", "record", and "provision" in the above table, where the out-degree of these terms is 0, which indicates that the nodes are located at leaf nodes in the directed graph, and the PageRank values of these nodes are high, but it can be noted that in this case, synonyms are present in the extracted nodes, which results in a large size of the extracted semantic elements, so the present invention combines terms with similar definitions based on synonym forest, such as "share" and "total", which can be expressed by only one of the terms.

The other is that the point where the PR value is the maximum in the loop, such as "asset", "income", and "shares", is in the same loop, PR (asset) > PR (income) > PR (shares), so that "asset" will be extracted as semantic primitive.

(2) Semantic primitive merging based on synonym forest

The semantic elements bear the ability of expressing the domain knowledge, but if the vocabularies in the semantic elements are many, the meaning of extracting the semantic elements is not large, so that the accuracy of knowledge expression is ensured, the efficiency of knowledge expression is also considered, and the maximum domain knowledge range can be expressed by the minimum semantic element scale. The semantic elements extracted by the research are concentrated in a non-accounting term set, and the extracted semantic elements have words with different definitions in similar forms based on the multi-style of language expression of an accounting dictionary in the compiling process, so that the words need to be combined, and the expression efficiency of the semantic elements is ensured to a greater extent.

Synonym forest merger

In order to improve the efficiency of knowledge expression, the preliminary extraction results are merged based on synonym forest. And installing a WordSimiarity module for calculating the similarity, reading all terms in the Excel file of the preliminary extraction result, and selecting 0.8 as the specified similarity in the invention to ensure the maximum application efficiency. If the similarity of the two terms is greater than 0.8, the same line in the new Excel file is written.

The semantic primitive parts merged by the synonym forest are shown in Table 5-3.

TABLE 3 synonym forest Merge

With this similarity calculation method, "high similarity" may mean "high similarity", or "high correlation", and for example, in the similarity defined as 0.8, "other person" is grouped with "oneself" and "self" is not grouped with 0.9. But it is clear that in 0.9 there are many fewer that can be grouped together.

And some terms are not merged in Excel, and the reasons are mainly as follows:

(1) most of the tables are proper nouns, which are not in wordsimirity-coded wordsiries, so that the similarity cannot be calculated.

(2) Many words in the text are synthesized words, and the text is composed of two words, such as 'employee fraud' and 'buyer and seller', which is not beneficial to calculating the similarity.

Example 2

Verifying the effectiveness of the invention

The model provided by the invention is adopted to extract semantic elements of the accounting field, a directed network graph is constructed for an accounting dictionary, an improved PageRank algorithm, namely a PRFR algorithm, is utilized to extract the semantic elements and describe the field knowledge, then a candidate set of the final semantic elements is obtained based on synonym forest combination, and a method based on word frequency and a method based on TF-IDF are used as reference experiments for comparative analysis.

(1) Word frequency based method

The term frequency-based method ranks terms according to frequency by counting occurrence frequency of the terms, and takes the top50 terms as semantic elements of the accounting field, as shown in table 4.

TABLE 4 semantic element extraction based on word frequency method

It can be found that among the candidate vocabularies obtained based on the word frequency method, Top10 has 8 vocabularies: the terms of "enterprise", "accounting", "company", "income", "commodity", "cash", "payment" and "amount" are words with too wide semantics, belong to high-frequency words in other subjects, and do not well represent the research basis of the accounting field, and the terms of "enterprise" and "company" are often defined as synonyms, and only "asset" and "cost" represent two accounting factors in the accounting field, and can be used as semantic elements of the accounting field. The scope is further expanded to the candidate words of Top30, and only the words such as "accounting statement", "audit", "fee" and the like can be used as semantic elements for characterizing the accounting field. Likewise, extending to the Top50 sample, semantic primitive and non-primitive terms also alternate. Therefore, although the word frequency-based method can find words with high frequency and high research heat in the field, the words are often superior words across the field or irrelevant words outside the field, the basic characterization capability of the research in a specific field is insufficient, the method which only depends on the word frequency is not ideal in the recognition and research of semantic elements, and especially when small-scale semantic elements are required to be screened as research objects, the basic words can not be extracted through the word frequency ranking to meet the actual requirement.

(2) TF-IDF-based method

The TF-IDF algorithm is adopted to rank the accounting terms, a semantic primitive candidate set of the accounting field is obtained according to the rank, and the semantic primitive candidate terms with the TF-IDF value of the top50 are intercepted and shown in the table 5.

Candidate primitives	Ranking	Candidate primitives	Ranking
				Accountant	1	Currency unit	26
Gain of	2	Property and its use	27
				Assets	3	Debt affairs	28
Construction engineering	4	Stock certificate	29
				Shares of stock	5	Issue(s)	30
Accounting statement	6	Invoice	31
				Cash money	7	Creditor	32
Payment	8	Income (R)	33
				Amount of money	9	Production of	34
Cost of	10	Audit teacher	35
				Recording	11	Labor affairs	36
Sale	12	Shareholder	37
				Audit reports	13	Bond and its making method	38
Account	14	Portion(s) of	39
				Security document	15	Debt	40
Administration	16	Money order	41
				Auditing	17	Expenditure of	42
Income of business	18	Profit and loss	43
				Finance affairs	19	Bill	44
Cost of	20	Financial status	45
				Data of	21	Decision making	46
Capital	22	Contract (contract)	47
				Interest information	23	Manager	48
Contractor	24	Property right	49
				Bank	25	Registration	50

It can be found that the 3 terms that can represent the knowledge of the accounting field are newly identified in the candidate word Top10 obtained based on the TF-IDF, such as "construction project", "accounting report", "stock". In the overall Top50 sample, the term coincidence proportion in the semantic primitive candidate set obtained by the two methods is 78%, that is, 39 candidate primitives belong to the two methods at the same time, and the difference is that the bit sequence order of partial terms is changed. Therefore, the ranking result based on TF-IDF is slightly better than the method based on word frequency as a whole, and some nodes with low frequency but more important can be ranked at the front position through TF-IDF. But simultaneously, a large amount of repetition of candidate words obtained by the two methods can be found, which shows that TF-IDF indexes are still linearly related to word frequency, and the characterization capability of the obtained candidate base selection on the knowledge in the accounting field is still limited.

(3) Method based on the model of the invention

And (3) constructing a directed network graph according to the co-occurrence relation of the accounting dictionary, ranking by using a PRFR algorithm, and obtaining an accounting field semantic primitive candidate set according to the ranking height, wherein the table is shown in a table 6-3.

Table 6 semantic element extraction based on this model

It can be found that, among candidate elements obtained based on the model method, "share", "asset", "rent", "cost", etc. all represent basic research directions and techniques of the accounting field, and can be defined as semantic elements of the accounting field, and only "adopt" and "information" in Top10 do not belong to field vocabulary. The scope is further expanded to the candidate words of Top30, and only a few terms such as "written", "efficiency", "none", etc. do not belong to the domain vocabulary, but can also be used as the semantic element of the accounting domain. In the Top50 sample, it can be seen that the proportion of the domain elements in the candidate vocabulary is higher than that of the non-primitive terms, and the important domain elements are ranked higher. In addition, the method of the invention uses synonym forest combination, so that the words with the same word senses can not appear, and the coverage of the obtained semantic primitive word senses is larger. Therefore, the semantic element extraction method based on the model has better effect than word frequency and TF-IDF, can find some important knowledge units which are not high in frequency but are positioned at core nodes in a network, and most terms with the top rank are semantic elements, so that the model method provided by the invention is effective and feasible, and can exert greater advantages in the case of task repetition needing to extract a small-range semantic element.

Quantitative evaluation based on blind selection experiment

The analysis discusses the experimental results from a qualitative perspective, and in order to further quantitatively evaluate the experimental results of the method, the invention designs a quantitative evaluation method based on blind selection experiments with reference to other documents. And in the blind selection experiment, three experimental results of word frequency, TF-IDF and the model method are used as objects for evaluation. The specific evaluation process is as follows: mixing the semantic element sets obtained by the three experiments, disordering the sequence to obtain 87 non-repeated candidate terms, and inviting the experimenter to select terms capable of representing the accounting field from the candidate words. Invitees are scientific researchers who have many years of research experience and are engaged in related research in the accounting field, totaling three people.

And counting the number and the proportion of semantic elements contained in the three methods respectively attributed to the vocabularies selected by each experimenter. Since the number of terms provided by the three methods in the candidate term set is equal, it can be considered that the method has better effect when the experimenter selects more words from which method. The results of the blind selection experiments are shown in table 7. Methods 1 to 3 correspond to a word frequency-based method, a TF-IDF-based method and a method for extracting semantic elements based on the model respectively.

TABLE 7 Blind selection experimental results

It can be seen that in the semantic elements obtained through blind selection experiments, the coincidence proportion of the traditional word frequency and the TF-IDF method is almost the same, the coincidence proportion based on the model method is far higher than that of the traditional word frequency and the TF-IDF method, the average accuracy rate of the model method reaches 66.71%, and the semantic element extraction method of the model can better fit the results of manual screening of experts to a certain extent.

Meanwhile, in practical applications, only a small part of basic vocabulary is required to be screened, so that the accuracy of the three methods at the nth position is further observed by using a p (N) index (N ═ 10,20,30,40,50), and the result is shown in table 8.

TABLE 8 accuracy of blind selection experiment

Method	P(10)	P(20)	P(30)	P(40)	P(50)
						Method 1	0.37	0.60	0.68	0.63	0.62
Method 2	0.43	0.62	0.60	0.62	0.62
						Method 3	0.73	0.75	0.69	0.73	0.73

It can be seen that the accuracy of the model-based method at each position is significantly higher than the accuracy of the word frequency method and the accuracy of the TF-IDF at the corresponding position, the average accuracy reaches 72.6%, wherein the indexes of P (10) and P (20) respectively reach 73% and 75%, that is, 7 words in the first 10 candidate words belong to the field elements, and 15 words in the first 20 candidate words belong to the basic vocabulary, so as to achieve a better recognition result. And TF-IDF is slightly higher than word frequency method in P (10) and P (20) indexes, and the difference between the indexes of P (30), P (40) and P (50) is not large, which indicates that TF-IDF is superior to word frequency method in the task of extracting small-scale semantic elements, and when the number of returned result samples is large, the difference between the two methods is not obvious.

In a whole view, when the method based on the model identifies the domain semantic elements, the domain elements with high importance can be found better through PageRank ranking, the semantic element word meaning coverage obtained based on synonym forest combination is larger, the condition that a large number of words with wide semantics and repetition in the result obtained by depending on word frequency and TF-IDF are ranked ahead is avoided, and the method has better expression and higher application value in finding the domain elements.

Semantic primitive to element manifest expressiveness

Based on the analysis of the vocabulary characteristics of the element list, the elements are found to have certain structural regularity, and the specific structure is summarized as follows. The structure of the element (G) is mainly composed of a core word, a time modifier, a space modifier, a cause-and-effect modifier, a general modifier, a state indicator and the like.

Structural categories of XBRL generic classification criteria financial information elements

Term + example word

Term + general Properties

Term + general Properties + example words

Term + causal Properties

Term + causal Property + instance word

Term + time attribute + instance word

Term + spatial Properties + instance words

Term + time attribute

Term + spatial Property

Term + instance word and term + instance word

Term + example word + general Property

Term + time attribute + instance word + cause and effect attribute

Term + time attribute + causal attribute

Term + time attribute + cause and effect attribute + instance word

Term + time attribute + cause and effect attribute + term + instance word

Term + general attribute + time attribute + instance word

Term + general attribute + time attribute ++ causal attribute + instance word

General Property + term general Property + instance word

Generic Property + terms + instance words

General attribute + term + temporal attribute + instance word

General Properties + terms + spatial Properties

General Property + term + general Property

General Property + term + general Property ++ time Property + instance word

General attribute + term + causal attribute

General attribute + term + time attribute + causal attribute

General Property + terms + instance words + general Properties

General Property + temporal Property + terminology

General attribute + temporal attribute + term + instance word

General attribute + time attribute + cause and effect attribute + term + instance word

General attribute + temporal attribute + term + instance word

General attribute + term + temporal attribute + instance word

General attribute + temporal attribute + term + instance word

General attribute + term + time attribute + instance word + general attribute

Time attribute + term

Time attribute + term + instance word

Time attribute + cause and effect attribute + term + instance word

Time attribute + general attribute + term + instance word

Temporal attribute + spatial attribute + term + instance word

Time attribute + general attribute + term

Time attribute + causal attribute + term + causal attribute

Time attribute + term + time attribute + instance word

Spatial Property + terminology

Spatial attributes + terms + instance words

Space attribute + general attribute + term + example word

Causal attributes + terms:

causal Property + term + instance word

Causal Property + term + general Property

Causal attribute + term + time attribute + instance word

Causal attribute + general attribute + term + instance word

For example:

g: fixed asset current reduction ═ Hx: fixed asset, Sj: at this stage, Zx: reduction >

Where "fixed asset" and "current date" are accounting terms and "reduction" is a well-defined non-accounting term, then the primitives for "current date reduction of fixed asset" based on the extracted semantic primitives are interpreted as:

g: fixed asset cost reduction is benefit period + year + over + asset + factory building and equipment + accounting period + reduction

As can be seen from the primitive expressions described above, the understandability of the extensibility terms is enhanced, and the primitives used for expression summarize the attribute of "fixed asset current reduction" from different perspectives.

From the above analysis, it can be known that the term in the XBRL general classification standard financial information element list can be divided into an accounting term and a non-accounting term after word segmentation, and the accounting term corresponds to an accounting dictionary and has corresponding semantic primitive expressions, and the non-accounting term has a definite definition, so that the amount of the accounting term in the element list is larger than that of the non-accounting term, which indicates that the semantic primitive can realize effective expression for the elements in the element list.

However, to measure the strength of effective expression, the intersection of the term to be calculated and the dictionary to be calculated needs to be taken, and through statistics, the term in the accounting dictionary can realize the full coverage of the word after the element is subjected to word segmentation, so that the extracted semantic elements can realize the strong expression capability on the element list.

Semantic primitive to instance expressiveness

Based on the discourse characteristic analysis of the financial report, the financial report is a discourse with clear hierarchy and clear structure, the financial report reveals financial information according to all levels of titles, and all levels of titles correspond to all items of basic accounting criteria of an enterprise, the whole financial report is in a tree structure, and the internal logic structure is strict. Meanwhile, the text content under the subtitle is interpreted in units of sections with the related information around the disclosed event. Therefore, to realize knowledge representation of financial reports, the present invention can realize reading of financial reports by machine through the use of the phrase type headings in the section headings and paragraphs. The method comprises the following concrete steps:

step 1: carrying out hierarchical division on the unstructured annual report document to obtain chapter titles and paragraph subtitles;

step 2: performing word segmentation and part-of-speech tagging on the chapter titles and the paragraph subtitles, and taking vocabularies as processing units;

and step 3: and obtaining corresponding primitive attributes based on the semantic primitive set to be used as the knowledge representation of the subtitles.

Finally, the effectiveness of the model is verified through the method, the superiority of the model is contrastively analyzed through a qualitative experiment based on the word frequency and TF-IDF as the reference, and the effectiveness of the model is quantitatively evaluated through a blind selection experiment; and finally, the expression of the financial report knowledge is completed based on the extracted semantic elements. The result shows that when the method based on the model is used for identifying the domain primitives, the domain primitives with high importance can be found better through PRFR ranking, the semantic primitive word sense coverage obtained based on synonym forest combination is larger, the condition that a large number of words with wide semantics and repetition in the result obtained by depending on word frequency and TF-IDF are ranked ahead is avoided, and the basic expression of financial reports can be realized based on the semantic primitives, so that the method has better expression and higher application value in the expression of domain knowledge.

Various modifications may be made by those skilled in the art based on the above teachings and concepts, and all such modifications are intended to be included within the scope of the present invention as defined in the appended claims.

Claims

1. A method for constructing an accounting term co-occurrence network graph is characterized by comprising the steps of extracting semantic elements of an accounting field, constructing a directed network graph through an accounting dictionary of important professional linguistic data of the accounting field, extracting the semantic elements and describing domain knowledge by using an improved PageRank algorithm, merging based on synonym forest, and finally obtaining a candidate set of the semantic elements.

2. The method for constructing an accounting term co-occurrence network diagram according to claim 1, wherein the method specifically comprises:

s3, constructing an accounting term directed network graph;

3. The method as claimed in claim 2, wherein in the step S2, the word segmentation is performed by using a jieba package carried by Python, and it is noted that, in order to ensure the completeness of the accounting terms, it is necessary to import the accounting terms in the accounting dictionary into the custom dictionary and establish a stop vocabulary to perform de-duplication on the words in the definition text of each term.

4. The method for constructing an accounting term co-occurrence network diagram according to claim 2, wherein in the step S3, the text is constructed into a directed loop diagram according to the word segmentation result; the method comprises the steps of taking words and defined text words after word segmentation as nodes, wherein a directed edge is arranged between the words and the defined text words, particularly, the words point to a plurality of defined text words, and if another word B appears in a defined text of a word A, a directed edge is arranged between the words A and the words B, particularly, the word A points to a directed edge of the words B.

5. The method as claimed in claim 2, wherein in step S5, the extracted semantic elements are concentrated in the non-accounting term set, and based on the multiple patterns of the language expression of the accounting dictionary during the composition process, the extracted semantic elements have different vocabularies with similar definitions, so that the vocabularies need to be merged to ensure the expression efficiency of the semantic elements to a greater extent.

6. The method for constructing an accounting term co-occurrence network diagram according to claim 2, wherein the core procedure in the step S4 is:

pr＝centrality(G,‘pagerank’,‘Followprobability’,0.85)

G.Nodes.PageRank＝pr

G.Nodes.InDegree＝indegree(G)

G.Nodes.OutDegree＝outdegree(G)

nodes% View PR score and level information for each node

title ('PageRank')% Chart drawing Using forced layout%

G.Nodes＝sortrows(G.Nodes,‘PageRank’,‘descend’)

% decreasing arrangement by PR value

H＝subgraph(G,find(G.Nodes.PageRank>0.005))

title(‘PageRank’)

colorbar