CN111797635A - Semantic element extraction method for XBRL field ontology - Google Patents
Semantic element extraction method for XBRL field ontology Download PDFInfo
- Publication number
- CN111797635A CN111797635A CN202010677371.9A CN202010677371A CN111797635A CN 111797635 A CN111797635 A CN 111797635A CN 202010677371 A CN202010677371 A CN 202010677371A CN 111797635 A CN111797635 A CN 111797635A
- Authority
- CN
- China
- Prior art keywords
- xbrl
- accounting
- extraction method
- semantic
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a semantic element extraction method for an XBRL field ontology, which comprises the following specific steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary; step 2, performing word segmentation, word stop removal and duplicate removal on the text; step 3, constructing an accounting term directed network graph; and 4, after a network graph is constructed based on an accounting dictionary, the PageRank value of each node is calculated by utilizing MATLAB R2016a and is used as a basis for semantic element extraction, and the semantic element extraction method oriented to the XBRL field ontology solves the problem that the semantic element extraction difficulty is attempted to be solved based on the currently popular machine learning algorithm, although labor and time costs are effectively reduced, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.
Description
Technical Field
The invention relates to the technical field of XBRL field ontologies, in particular to a semantic element extraction method for XBRL field ontologies.
Background
The domain ontology is a specification description of a shared concept model in a specific domain, reflects a knowledge structure of the domain through representation of concepts and relations thereof, is helpful for enhancing human-computer interaction and information exchange between machines, and is also called a form ontology because the XBRL domain ontology is a set of financial report term systems and related examples based on sharing and formalization principles when oriented to the financial report domain. The needed classification standard can be automatically generated through the XBRL field ontology, and reasoning and checking on financial data are supported, so that research on the XBRL field ontology is very meaningful, but at present, no systematic and complete ontology is built in the financial reporting field, and the ontology-based financial reporting research is mostly focused on discussion and simple verification of a theoretical process and is not achieved by the system. The main reasons are that no professional concept system guides the application of the mark in the XBRL field, and the semantics of the concept in the XBRL financial report is weak, which influences the production and data sharing of the XBRL financial report.
The current XBRL field lacks standardized knowledge description, so the difficulty is met in the aspect of solving the readability of computer to XBRL financial information, the use breadth and the development prospect of XBRL are hindered, the difficulty of semantic primitive extraction is tried to be solved based on the current popular machine learning algorithm, though the method effectively reduces labor cost and time cost, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a semantic element extraction method facing to an XBRL field ontology, which solves the problem that the extracted terms have a large amount of noise, are not outstanding in field characteristics and cannot verify the validity of the extracted terms although the method effectively reduces labor and time costs by trying to solve the problem of semantic element extraction based on the current popular machine learning algorithm.
In order to achieve the purpose, the invention is realized by the following technical scheme: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
step 3, constructing an accounting term directed network graph;
step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;
and 5, merging the semantic elements based on the synonym forest.
Preferably, in the step 1, a definition text of the accounting term is manually extracted and arranged, and is summarized in Excel.
Preferably, step 2 is to cut words by using a jieba package carried by the Python, and to import 4 accounting terms in the counting dictionary into the custom dictionary, and then to establish a deactivation vocabulary, and to perform de-duplication processing on the words in the definition text of each term.
Preferably, in step 3, the specific construction idea is to use the vocabulary and the cut-word definition text as nodes, there is a directed edge between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definitions text vocabularies, and another vocabulary B appears in the definition text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.
Preferably, the semantic primitives in step 4 exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.
Preferably, in the step 5, words with different definitions in similar forms exist in the extracted semantic elements, and are merged.
Preferably, Excel is used for structured arrangement of an accounting dictionary.
Advantageous effects
The invention provides a semantic element extraction method for an XBRL field ontology. The method has the following beneficial effects:
according to the semantic element extraction method for the XBRL field ontology, the semantic elements are merged based on the synonym forest, the expression efficiency of the semantic elements is guaranteed to a large extent, the largest field knowledge range is expressed in the smallest semantic element scale, and the problem that the extracted terms have a large amount of noise, the field characteristics are not outstanding and the validity of the extracted terms cannot be verified although the method effectively reduces labor cost and time cost.
Drawings
FIG. 1 is a flowchart of the semantic element extraction method oriented to XBRL domain ontology according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
and 3, constructing an accounting term directed network graph.
And 4, constructing a network graph based on an accounting dictionary, and calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction.
And 5, merging the semantic elements based on the synonym forest.
Further, in step 1, the text for defining the accounting terms is manually extracted and collated, and is summarized in Excel.
Further, step 2 is specifically to cut words by using a jieba package carried by the Python, to introduce 4 accounting terms in the counting dictionary into the custom dictionary, then to establish a disabled word list, and to perform de-duplication processing on the words in the definition text of each term.
Further, in step 3, the specific construction idea is to use the vocabulary and the defined text after word segmentation as nodes, there is a directed edge between the vocabulary and the defined text, specifically, the vocabulary points to a plurality of defined text vocabularies, and another vocabulary B appears in the defined text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.
Further, in step 4, semantic primitives exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.
Furthermore, in step 5, words with different definitions and similar forms exist in the extracted semantic elements, and are merged.
Further, Excel is used for structured arrangement of an accounting dictionary.
A semantic element extraction method for an XBRL field ontology specifically comprises the following steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary, manually extracting and sorting the definition text of the accounting term from the text in the step 1, and summarizing the definition text into Excel, wherein the Excel is used for structured sorting of the accounting dictionary;
in the invention, step 2, word cutting, word stop and duplicate removal processing are carried out on the text, wherein the step 2 specifically comprises the steps of utilizing a jieba packet carried by Python to carry out word cutting, leading 4 accounting terms in a counting dictionary into a user-defined dictionary, then establishing a word stop word list, and carrying out duplicate removal processing on words in a definition text of each term;
step 3, constructing an accounting term directed network graph; in the step 3, the specific construction idea takes the vocabulary and the definition text after word segmentation as nodes, a directed edge is arranged between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge pointing to B of A;
in the invention, after a network graph is constructed based on an accounting dictionary in step 4, a PageRank value of each node is calculated by using MATLAB R2016a as a basis for semantic element extraction, and in step 4, the semantic elements exist in a point with the maximum PageRank value in a loop and leaf nodes in a non-loop;
in the invention, step 5, the semantic elements based on the synonym forest are merged, and the extracted semantic elements with different definitions in similar forms are merged in step 5.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (7)
1. A semantic element extraction method for an XBRL field ontology specifically comprises the following steps:
step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;
step 2, performing word segmentation, word stop removal and duplicate removal on the text;
step 3, constructing an accounting term directed network graph;
step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;
and 5, merging the semantic elements based on the synonym forest.
2. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 1, the text for defining the accounting terms is manually extracted and arranged, and is summarized in Excel.
3. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: and 2, specifically, performing word segmentation by using a jieba package carried by Python, importing 4 accounting terms in a counting dictionary into a custom dictionary, then establishing a stop word list, and performing de-duplication processing on words in a definition text of each term.
4. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the specific construction idea in the step 3, the vocabulary and the definition text after word segmentation are taken as nodes, a directed edge exists between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge of A points to B.
5. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 4, semantic primitives exist in a point with the maximum PageRank value in the loop and leaf nodes in the non-loop.
6. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 5, words with different definitions and similar forms exist in the extracted semantic elements are merged.
7. The XBRL field ontology-oriented semantic primitive extraction method according to claim 2, wherein the method comprises the following steps: the Excel is used for structuring and sorting an accounting dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010677371.9A CN111797635A (en) | 2020-07-14 | 2020-07-14 | Semantic element extraction method for XBRL field ontology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010677371.9A CN111797635A (en) | 2020-07-14 | 2020-07-14 | Semantic element extraction method for XBRL field ontology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111797635A true CN111797635A (en) | 2020-10-20 |
Family
ID=72806991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010677371.9A Pending CN111797635A (en) | 2020-07-14 | 2020-07-14 | Semantic element extraction method for XBRL field ontology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111797635A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113919342A (en) * | 2021-09-18 | 2022-01-11 | 暨南大学 | Method for constructing accounting term co-occurrence network diagram |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178045A (en) * | 2019-10-14 | 2020-05-19 | 深圳软通动力信息技术有限公司 | Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium |
-
2020
- 2020-07-14 CN CN202010677371.9A patent/CN111797635A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111178045A (en) * | 2019-10-14 | 2020-05-19 | 深圳软通动力信息技术有限公司 | Automatic construction method of non-supervised Chinese semantic concept dictionary based on field, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
DI YE等: "Semantic Primitives Extraction for XBRL Domain Ontology", 《MODERN ECONOMY》, vol. 11, no. 3, pages 686 - 700 * |
冯丽: "词义基元的内涵及其在同义词群建构中的作用", no. 3, pages 113 - 116 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113919342A (en) * | 2021-09-18 | 2022-01-11 | 暨南大学 | Method for constructing accounting term co-occurrence network diagram |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022022045A1 (en) | Knowledge graph-based text comparison method and apparatus, device, and storage medium | |
CN102591988B (en) | Short text classification method based on semantic graphs | |
Deshwal et al. | Twitter sentiment analysis using various classification algorithms | |
CN110851598B (en) | Text classification method and device, terminal equipment and storage medium | |
CN107229668A (en) | A kind of text extracting method based on Keywords matching | |
CN101127042A (en) | Sensibility classification method based on language model | |
CN109471933A (en) | A kind of generation method of text snippet, storage medium and server | |
Nagamanjula et al. | A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis | |
CN111460170B (en) | Word recognition method, device, terminal equipment and storage medium | |
CN104899230A (en) | Public opinion hotspot automatic monitoring system | |
CN112101971B (en) | Sensitive client identification method, system and storage medium | |
CN108021582B (en) | Internet public opinion monitoring method and device | |
CN110909540B (en) | Method and device for identifying new words of short message spam and electronic equipment | |
CN113641833B (en) | Service demand matching method and device | |
Bharathi et al. | Sentiment Analysis of Twitter and RSS News Feeds and Its Impact on Stock Market Prediction. | |
CN110222192A (en) | Corpus method for building up and device | |
CN115186650B (en) | Data detection method and related device | |
CN112328657A (en) | Feature derivation method, feature derivation device, computer equipment and medium | |
CN111190873B (en) | Log mode extraction method and system for log training of cloud native system | |
CN106372237A (en) | Fraudulent mail identification method and device | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
CN111797635A (en) | Semantic element extraction method for XBRL field ontology | |
US20200097605A1 (en) | Machine learning techniques for automatic validation of events | |
CN115495587A (en) | Alarm analysis method and device based on knowledge graph | |
Nguyen et al. | Structural reranking models for named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |