CN111797635A

CN111797635A - Semantic element extraction method for XBRL field ontology

Info

Publication number: CN111797635A
Application number: CN202010677371.9A
Authority: CN
Inventors: 潘定; 叶迪; 梁倬骞
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-10-20

Abstract

The invention discloses a semantic element extraction method for an XBRL field ontology, which comprises the following specific steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary; step 2, performing word segmentation, word stop removal and duplicate removal on the text; step 3, constructing an accounting term directed network graph; and 4, after a network graph is constructed based on an accounting dictionary, the PageRank value of each node is calculated by utilizing MATLAB R2016a and is used as a basis for semantic element extraction, and the semantic element extraction method oriented to the XBRL field ontology solves the problem that the semantic element extraction difficulty is attempted to be solved based on the currently popular machine learning algorithm, although labor and time costs are effectively reduced, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.

Description

Semantic element extraction method for XBRL field ontology

Technical Field

The invention relates to the technical field of XBRL field ontologies, in particular to a semantic element extraction method for XBRL field ontologies.

Background

The domain ontology is a specification description of a shared concept model in a specific domain, reflects a knowledge structure of the domain through representation of concepts and relations thereof, is helpful for enhancing human-computer interaction and information exchange between machines, and is also called a form ontology because the XBRL domain ontology is a set of financial report term systems and related examples based on sharing and formalization principles when oriented to the financial report domain. The needed classification standard can be automatically generated through the XBRL field ontology, and reasoning and checking on financial data are supported, so that research on the XBRL field ontology is very meaningful, but at present, no systematic and complete ontology is built in the financial reporting field, and the ontology-based financial reporting research is mostly focused on discussion and simple verification of a theoretical process and is not achieved by the system. The main reasons are that no professional concept system guides the application of the mark in the XBRL field, and the semantics of the concept in the XBRL financial report is weak, which influences the production and data sharing of the XBRL financial report.

The current XBRL field lacks standardized knowledge description, so the difficulty is met in the aspect of solving the readability of computer to XBRL financial information, the use breadth and the development prospect of XBRL are hindered, the difficulty of semantic primitive extraction is tried to be solved based on the current popular machine learning algorithm, though the method effectively reduces labor cost and time cost, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a semantic element extraction method facing to an XBRL field ontology, which solves the problem that the extracted terms have a large amount of noise, are not outstanding in field characteristics and cannot verify the validity of the extracted terms although the method effectively reduces labor and time costs by trying to solve the problem of semantic element extraction based on the current popular machine learning algorithm.

In order to achieve the purpose, the invention is realized by the following technical scheme: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;

step 2, performing word segmentation, word stop removal and duplicate removal on the text;

step 3, constructing an accounting term directed network graph;

step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;

and 5, merging the semantic elements based on the synonym forest.

Preferably, in the step 1, a definition text of the accounting term is manually extracted and arranged, and is summarized in Excel.

Preferably, step 2 is to cut words by using a jieba package carried by the Python, and to import 4 accounting terms in the counting dictionary into the custom dictionary, and then to establish a deactivation vocabulary, and to perform de-duplication processing on the words in the definition text of each term.

Preferably, in step 3, the specific construction idea is to use the vocabulary and the cut-word definition text as nodes, there is a directed edge between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definitions text vocabularies, and another vocabulary B appears in the definition text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.

Preferably, the semantic primitives in step 4 exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.

Preferably, in the step 5, words with different definitions in similar forms exist in the extracted semantic elements, and are merged.

Preferably, Excel is used for structured arrangement of an accounting dictionary.

Advantageous effects

The invention provides a semantic element extraction method for an XBRL field ontology. The method has the following beneficial effects:

according to the semantic element extraction method for the XBRL field ontology, the semantic elements are merged based on the synonym forest, the expression efficiency of the semantic elements is guaranteed to a large extent, the largest field knowledge range is expressed in the smallest semantic element scale, and the problem that the extracted terms have a large amount of noise, the field characteristics are not outstanding and the validity of the extracted terms cannot be verified although the method effectively reduces labor cost and time cost.

Drawings

FIG. 1 is a flowchart of the semantic element extraction method oriented to XBRL domain ontology according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

and 3, constructing an accounting term directed network graph.

And 4, constructing a network graph based on an accounting dictionary, and calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction.

And 5, merging the semantic elements based on the synonym forest.

Further, in step 1, the text for defining the accounting terms is manually extracted and collated, and is summarized in Excel.

Further, step 2 is specifically to cut words by using a jieba package carried by the Python, to introduce 4 accounting terms in the counting dictionary into the custom dictionary, then to establish a disabled word list, and to perform de-duplication processing on the words in the definition text of each term.

Further, in step 3, the specific construction idea is to use the vocabulary and the defined text after word segmentation as nodes, there is a directed edge between the vocabulary and the defined text, specifically, the vocabulary points to a plurality of defined text vocabularies, and another vocabulary B appears in the defined text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.

Further, in step 4, semantic primitives exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.

Furthermore, in step 5, words with different definitions and similar forms exist in the extracted semantic elements, and are merged.

Further, Excel is used for structured arrangement of an accounting dictionary.

A semantic element extraction method for an XBRL field ontology specifically comprises the following steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary, manually extracting and sorting the definition text of the accounting term from the text in the step 1, and summarizing the definition text into Excel, wherein the Excel is used for structured sorting of the accounting dictionary;

in the invention, step 2, word cutting, word stop and duplicate removal processing are carried out on the text, wherein the step 2 specifically comprises the steps of utilizing a jieba packet carried by Python to carry out word cutting, leading 4 accounting terms in a counting dictionary into a user-defined dictionary, then establishing a word stop word list, and carrying out duplicate removal processing on words in a definition text of each term;

step 3, constructing an accounting term directed network graph; in the step 3, the specific construction idea takes the vocabulary and the definition text after word segmentation as nodes, a directed edge is arranged between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge pointing to B of A;

in the invention, after a network graph is constructed based on an accounting dictionary in step 4, a PageRank value of each node is calculated by using MATLAB R2016a as a basis for semantic element extraction, and in step 4, the semantic elements exist in a point with the maximum PageRank value in a loop and leaf nodes in a non-loop;

in the invention, step 5, the semantic elements based on the synonym forest are merged, and the extracted semantic elements with different definitions in similar forms are merged in step 5.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

step 3, constructing an accounting term directed network graph;

and 5, merging the semantic elements based on the synonym forest.

2. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 1, the text for defining the accounting terms is manually extracted and arranged, and is summarized in Excel.

3. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: and 2, specifically, performing word segmentation by using a jieba package carried by Python, importing 4 accounting terms in a counting dictionary into a custom dictionary, then establishing a stop word list, and performing de-duplication processing on words in a definition text of each term.

4. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the specific construction idea in the step 3, the vocabulary and the definition text after word segmentation are taken as nodes, a directed edge exists between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge of A points to B.

5. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 4, semantic primitives exist in a point with the maximum PageRank value in the loop and leaf nodes in the non-loop.

6. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 5, words with different definitions and similar forms exist in the extracted semantic elements are merged.

7. The XBRL field ontology-oriented semantic primitive extraction method according to claim 2, wherein the method comprises the following steps: the Excel is used for structuring and sorting an accounting dictionary.