CN111797635A - Semantic element extraction method for XBRL field ontology - Google Patents

Semantic element extraction method for XBRL field ontology Download PDF

Info

Publication number
CN111797635A
CN111797635A CN202010677371.9A CN202010677371A CN111797635A CN 111797635 A CN111797635 A CN 111797635A CN 202010677371 A CN202010677371 A CN 202010677371A CN 111797635 A CN111797635 A CN 111797635A
Authority
CN
China
Prior art keywords
xbrl
accounting
extraction method
semantic
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010677371.9A
Other languages
Chinese (zh)
Inventor
潘定
叶迪
梁倬骞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202010677371.9A priority Critical patent/CN111797635A/en
Publication of CN111797635A publication Critical patent/CN111797635A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semantic element extraction method for an XBRL field ontology, which comprises the following specific steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary; step 2, performing word segmentation, word stop removal and duplicate removal on the text; step 3, constructing an accounting term directed network graph; and 4, after a network graph is constructed based on an accounting dictionary, the PageRank value of each node is calculated by utilizing MATLAB R2016a and is used as a basis for semantic element extraction, and the semantic element extraction method oriented to the XBRL field ontology solves the problem that the semantic element extraction difficulty is attempted to be solved based on the currently popular machine learning algorithm, although labor and time costs are effectively reduced, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.

Description

Semantic element extraction method for XBRL field ontology
Technical Field
The invention relates to the technical field of XBRL field ontologies, in particular to a semantic element extraction method for XBRL field ontologies.
Background
The domain ontology is a specification description of a shared concept model in a specific domain, reflects a knowledge structure of the domain through representation of concepts and relations thereof, is helpful for enhancing human-computer interaction and information exchange between machines, and is also called a form ontology because the XBRL domain ontology is a set of financial report term systems and related examples based on sharing and formalization principles when oriented to the financial report domain. The needed classification standard can be automatically generated through the XBRL field ontology, and reasoning and checking on financial data are supported, so that research on the XBRL field ontology is very meaningful, but at present, no systematic and complete ontology is built in the financial reporting field, and the ontology-based financial reporting research is mostly focused on discussion and simple verification of a theoretical process and is not achieved by the system. The main reasons are that no professional concept system guides the application of the mark in the XBRL field, and the semantics of the concept in the XBRL financial report is weak, which influences the production and data sharing of the XBRL financial report.
The current XBRL field lacks standardized knowledge description, so the difficulty is met in the aspect of solving the readability of computer to XBRL financial information, the use breadth and the development prospect of XBRL are hindered, the difficulty of semantic primitive extraction is tried to be solved based on the current popular machine learning algorithm, though the method effectively reduces labor cost and time cost, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a semantic element extraction method facing to an XBRL field ontology, which solves the problem that the extracted terms have a large amount of noise, are not outstanding in field characteristics and cannot verify the validity of the extracted terms although the method effectively reduces labor and time costs by trying to solve the problem of semantic element extraction based on the current popular machine learning algorithm.
In order to achieve the purpose, the invention is realized by the following technical scheme: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
step 3, constructing an accounting term directed network graph;
step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;
and 5, merging the semantic elements based on the synonym forest.
Preferably, in the step 1, a definition text of the accounting term is manually extracted and arranged, and is summarized in Excel.
Preferably, step 2 is to cut words by using a jieba package carried by the Python, and to import 4 accounting terms in the counting dictionary into the custom dictionary, and then to establish a deactivation vocabulary, and to perform de-duplication processing on the words in the definition text of each term.
Preferably, in step 3, the specific construction idea is to use the vocabulary and the cut-word definition text as nodes, there is a directed edge between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definitions text vocabularies, and another vocabulary B appears in the definition text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.
Preferably, the semantic primitives in step 4 exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.
Preferably, in the step 5, words with different definitions in similar forms exist in the extracted semantic elements, and are merged.
Preferably, Excel is used for structured arrangement of an accounting dictionary.
Advantageous effects
The invention provides a semantic element extraction method for an XBRL field ontology. The method has the following beneficial effects:
according to the semantic element extraction method for the XBRL field ontology, the semantic elements are merged based on the synonym forest, the expression efficiency of the semantic elements is guaranteed to a large extent, the largest field knowledge range is expressed in the smallest semantic element scale, and the problem that the extracted terms have a large amount of noise, the field characteristics are not outstanding and the validity of the extracted terms cannot be verified although the method effectively reduces labor cost and time cost.
Drawings
FIG. 1 is a flowchart of the semantic element extraction method oriented to XBRL domain ontology according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
and 3, constructing an accounting term directed network graph.
And 4, constructing a network graph based on an accounting dictionary, and calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction.
And 5, merging the semantic elements based on the synonym forest.
Further, in step 1, the text for defining the accounting terms is manually extracted and collated, and is summarized in Excel.
Further, step 2 is specifically to cut words by using a jieba package carried by the Python, to introduce 4 accounting terms in the counting dictionary into the custom dictionary, then to establish a disabled word list, and to perform de-duplication processing on the words in the definition text of each term.
Further, in step 3, the specific construction idea is to use the vocabulary and the defined text after word segmentation as nodes, there is a directed edge between the vocabulary and the defined text, specifically, the vocabulary points to a plurality of defined text vocabularies, and another vocabulary B appears in the defined text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.
Further, in step 4, semantic primitives exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.
Furthermore, in step 5, words with different definitions and similar forms exist in the extracted semantic elements, and are merged.
Further, Excel is used for structured arrangement of an accounting dictionary.
A semantic element extraction method for an XBRL field ontology specifically comprises the following steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary, manually extracting and sorting the definition text of the accounting term from the text in the step 1, and summarizing the definition text into Excel, wherein the Excel is used for structured sorting of the accounting dictionary;
in the invention, step 2, word cutting, word stop and duplicate removal processing are carried out on the text, wherein the step 2 specifically comprises the steps of utilizing a jieba packet carried by Python to carry out word cutting, leading 4 accounting terms in a counting dictionary into a user-defined dictionary, then establishing a word stop word list, and carrying out duplicate removal processing on words in a definition text of each term;
step 3, constructing an accounting term directed network graph; in the step 3, the specific construction idea takes the vocabulary and the definition text after word segmentation as nodes, a directed edge is arranged between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge pointing to B of A;
in the invention, after a network graph is constructed based on an accounting dictionary in step 4, a PageRank value of each node is calculated by using MATLAB R2016a as a basis for semantic element extraction, and in step 4, the semantic elements exist in a point with the maximum PageRank value in a loop and leaf nodes in a non-loop;
in the invention, step 5, the semantic elements based on the synonym forest are merged, and the extracted semantic elements with different definitions in similar forms are merged in step 5.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
step 3, constructing an accounting term directed network graph;
step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;
and 5, merging the semantic elements based on the synonym forest.
2. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 1, the text for defining the accounting terms is manually extracted and arranged, and is summarized in Excel.
3. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: and 2, specifically, performing word segmentation by using a jieba package carried by Python, importing 4 accounting terms in a counting dictionary into a custom dictionary, then establishing a stop word list, and performing de-duplication processing on words in a definition text of each term.
4. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the specific construction idea in the step 3, the vocabulary and the definition text after word segmentation are taken as nodes, a directed edge exists between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge of A points to B.
5. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 4, semantic primitives exist in a point with the maximum PageRank value in the loop and leaf nodes in the non-loop.
6. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 5, words with different definitions and similar forms exist in the extracted semantic elements are merged.
7. The XBRL field ontology-oriented semantic primitive extraction method according to claim 2, wherein the method comprises the following steps: the Excel is used for structuring and sorting an accounting dictionary.
CN202010677371.9A 2020-07-14 2020-07-14 Semantic element extraction method for XBRL field ontology Pending CN111797635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010677371.9A CN111797635A (en) 2020-07-14 2020-07-14 Semantic element extraction method for XBRL field ontology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010677371.9A CN111797635A (en) 2020-07-14 2020-07-14 Semantic element extraction method for XBRL field ontology

Publications (1)

Publication Number Publication Date
CN111797635A true CN111797635A (en) 2020-10-20

Family

ID=72806991

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010677371.9A Pending CN111797635A (en) 2020-07-14 2020-07-14 Semantic element extraction method for XBRL field ontology

Country Status (1)

Country Link
CN (1) CN111797635A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919342A (en) * 2021-09-18 2022-01-11 暨南大学 Method for constructing accounting term co-occurrence network diagram

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178045A (en) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178045A (en) * 2019-10-14 2020-05-19 深圳软通动力信息技术有限公司 Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DI YE等: "Semantic Primitives Extraction for XBRL Domain Ontology", 《MODERN ECONOMY》, vol. 11, no. 3, pages 686 - 700 *
冯丽: "词义基元的内涵及其在同义词群建构中的作用", no. 3, pages 113 - 116 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919342A (en) * 2021-09-18 2022-01-11 暨南大学 Method for constructing accounting term co-occurrence network diagram

Similar Documents

Publication Publication Date Title
WO2022022045A1 (en) Knowledge graph-based text comparison method and apparatus, device, and storage medium
CN102591988B (en) Short text classification method based on semantic graphs
Deshwal et al. Twitter sentiment analysis using various classification algorithms
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN107229668A (en) A kind of text extracting method based on Keywords matching
CN101127042A (en) Sensibility classification method based on language model
CN109471933A (en) A kind of generation method of text snippet, storage medium and server
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN104899230A (en) Public opinion hotspot automatic monitoring system
CN112101971B (en) Sensitive client identification method, system and storage medium
CN108021582B (en) Internet public opinion monitoring method and device
CN110909540B (en) Method and device for identifying new words of short message spam and electronic equipment
CN113641833B (en) Service demand matching method and device
Bharathi et al. Sentiment Analysis of Twitter and RSS News Feeds and Its Impact on Stock Market Prediction.
CN110222192A (en) Corpus method for building up and device
CN115186650B (en) Data detection method and related device
CN112328657A (en) Feature derivation method, feature derivation device, computer equipment and medium
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
CN106372237A (en) Fraudulent mail identification method and device
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN111797635A (en) Semantic element extraction method for XBRL field ontology
US20200097605A1 (en) Machine learning techniques for automatic validation of events
CN115495587A (en) Alarm analysis method and device based on knowledge graph
Nguyen et al. Structural reranking models for named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination