CN114742068A - Multi-sentence correlation analysis method and system for ISO19650 standard text - Google Patents

Multi-sentence correlation analysis method and system for ISO19650 standard text Download PDF

Info

Publication number
CN114742068A
CN114742068A CN202210355791.4A CN202210355791A CN114742068A CN 114742068 A CN114742068 A CN 114742068A CN 202210355791 A CN202210355791 A CN 202210355791A CN 114742068 A CN114742068 A CN 114742068A
Authority
CN
China
Prior art keywords
sentence
sentences
iso19650
standard
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210355791.4A
Other languages
Chinese (zh)
Inventor
吴冰
刘伟军
宋元斌
胡锡燎
诸言涵
曹金浩
张波
陈科技
王淑红
王婷婷
张琳琳
杨嘉睿
陈赛慧
杨铁涵
黄江倩
林贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Economic and Technological Research Institute of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Shanghai Jiaotong University
Economic and Technological Research Institute of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University, Economic and Technological Research Institute of State Grid Zhejiang Electric Power Co Ltd filed Critical Shanghai Jiaotong University
Priority to CN202210355791.4A priority Critical patent/CN114742068A/en
Publication of CN114742068A publication Critical patent/CN114742068A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Abstract

The invention provides a multi-sentence correlation analysis method and a multi-sentence correlation analysis system for ISO19650 standard texts, which relate to the technical field of information processing, and comprise the following steps: step S1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences; step S2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences; step S3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence; step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences. The method can overcome the difficulty in extracting semantic information of the ISO19650 standard Chinese text caused by the shortage of the corpus, and is also helpful for solving the difficulty in automatic analysis of association and reference among ISO19650 sentences.

Description

Multi-sentence correlation analysis method and system for ISO19650 standard text
Technical Field
The invention relates to the technical field of information processing, in particular to a method for analyzing association between multiple sentences in ISO19650 standard series texts based on NLP and an ontology model.
Background
Engineering project development requires all participants to convey explicit information in a timely manner. In addition to the IFC file format, they also require an information management framework to support their collaboration. The ISO19650 family of standards provides such a framework to establish a reliable source of information. Since the ISO19650 family of standards, consisting of 5 parts, constitutes a complex system, the construction industry desires to be able to capture the semantic information in these standards.
However, manually extracting semantic information in the ISO19650 standard series is not only time-consuming, but also costly. Therefore, a semantic information extraction method based on NLP needs to be specially developed, so as to automatically analyze the association and reference relationship between each standard article by means of the ontology model of ISO19650 standard.
The invention patent with publication number CN110096692B discloses a semantic information processing method and a device, the semantic information processing method comprises dividing a question stem into two parts of a known condition and a conclusion according to the obtained question stem; extracting the explicit semantic information in the known conditions and the conclusions according to the obtained known conditions and the conclusions; when implicit semantic information exists in the known conditions and/or conclusions, extracting the implicit semantic information in the known conditions and/or conclusions; and combining the extracted explicit semantic information and the extracted implicit semantic information to obtain the semantic information of the question stem.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a multi-sentence correlation analysis method and system of ISO19650 standard texts.
According to the multi-sentence correlation analysis method and system of the ISO19650 standard text, the scheme is as follows:
in a first aspect, a method for multi-sentence association analysis of ISO19650 standard text is provided, the method comprising:
step S1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;
step S2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;
step S3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;
step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing the link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among a plurality of sentences.
Preferably, the step S1 includes:
step S1.1: acquiring a text file of a Chinese edition ISO19650 standard series;
step S1.2: extracting sentences according to each standard entry, and performing sentence segmentation;
step S1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.
Preferably, the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLib library, then the XML files are analyzed from the decompressed files according to the entry coding rule of the ISO19650 standard series, standard entry contents are extracted from the XML files, all fonts and paragraph typesetting are deleted, and finally a plain text file containing a statement list is generated.
Preferably, the step S2 includes: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
Preferably, the step S3 includes: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
In a second aspect, there is provided a multi-sentence correlation analysis system of ISO19650 standard text, the system comprising:
module M1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;
module M2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;
module M3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;
module M4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.
Preferably, said module M1 comprises:
module M1.1: acquiring a text file of a Chinese edition ISO19650 standard series;
module M1.2: extracting sentences according to each standard entry, and performing sentence segmentation;
module M1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.
Preferably, the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLib library, then the XML files are analyzed from the decompressed files according to the entry coding rule of the ISO19650 standard series, standard entry contents are extracted from the XML files, all fonts and paragraph typesetting are deleted, and finally a plain text file containing a statement list is generated.
Preferably, said module M2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
Preferably, said module M3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the domain semantic relation is deduced from the sentence relation through the mapping rule, the difficulty caused by the shortage of the corpus is greatly overcome, and the feasibility and the practicability of the provided information extraction method are verified through experiments.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a block diagram of a dependency syntax tree;
FIG. 3 is an example of a dependency tree for a sentence;
FIG. 4 is a diagram of exemplary semantic relationships;
FIG. 5 is a diagram illustrating the conversion from syntactic to semantic relationships;
FIG. 6 is a knowledge-graph stored in a Neo4J graph database;
FIG. 7 is a case of inferring associations between standard statements.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the concept of the invention. All falling within the scope of the present invention.
The embodiment of the invention provides a method for analyzing association between multiple sentences based on NLP standard text, which specifically comprises the following steps of:
step S1: and performing word segmentation and word change processing on the sentences in ISO19650 to obtain preprocessed sentences.
The step S1 specifically includes:
step S1.1: files of the ISO19650 standard series chinese translated text are obtained.
Step S1.2: extracting sentences according to standard entries, and segmenting words of the sentences.
Step S1.3: and (3) changing words of the words (named entities corresponding to the words in the sentence) obtained by word segmentation, and replacing the professional terms with the superior words, wherein the replaced superior words are words with high identification and analysis accuracy of the DDParser.
The method comprises the steps of extracting standard entry contents from a set of XML files, deleting all font types and paragraph typesetting, and finally generating a plain text file containing a statement list, wherein the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLIb library, the XML files are analyzed from the decompressed files according to the entry coding rule of ISO19650, and the plain text file is generated.
Step S2: and performing dependency syntax analysis on the preprocessed sentences to acquire dependency relations among the words in the sentences. And performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
Step S3: and reasoning according to the dependency relationship to semantic relationship conversion rule to obtain the semantic relationship between the words in the single sentence. And semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
Step S4: the semantic relation in a single sentence is imported into a graph database, an ontology model of the ISO standard is imported into the graph database, the links between the words (named entities corresponding to the words in the sentence) in each sentence and the words in the ontology model are established, and the association relation among a plurality of sentences is deduced.
The invention also provides a system for analyzing the association between multiple sentences based on the NLP standard text, which specifically comprises the following steps:
module M1: and performing word segmentation and word change processing on the sentences in ISO19650 to obtain preprocessed sentences.
The module M1 specifically includes:
module M1.1: files of the ISO19650 standard series chinese translated text are obtained.
Module M1.2: extracting sentences according to standard entries, and segmenting words of the sentences.
Module M1.3: and (3) changing words of the words (named entities corresponding to the words in the sentence) obtained by word segmentation, and replacing the professional terms with the superior words, wherein the replaced superior words are words with high identification and analysis accuracy of the DDParser.
The method comprises the steps of extracting standard entry contents from a set of XML files, deleting all font types and paragraph typesetting, and finally generating a plain text file containing a statement list, wherein the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLIb library, the XML files are analyzed from the decompressed files according to the entry coding rule of ISO19650, and the plain text file is generated.
Module M2: and performing dependency syntax analysis on the preprocessed sentences to acquire dependency relations among the words in the sentences. And performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
Module M3: and reasoning according to the dependency relationship to semantic relationship conversion rule to obtain the semantic relationship between the words in the single sentence. And semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
Module M4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.
Next, the present invention will be described in more detail.
The invention provides a method for analyzing association between multiple sentences in an ISO19650 standard series text based on an NLP and an ontology model, in particular to a method for extracting semantic information and analyzing association between sentences of an ISO19650 standard series Chinese translation text based on the NLP and the ontology model, which comprises the following steps:
1. pretreatment of ISO19650 texts:
referring to fig. 1, a preprocessing flow of sentences in the ISO19650 standard series is shown, wherein the sentences are firstly segmented and then replaced by hypernyms aiming at the professional terms.
The original standard text, the English version, is first translated into Chinese and stored in Microsoft docx format. All sentences in each part of the ISO standard are then extracted using the pre-processing procedure shown in fig. 1.
The docx file is actually a compressed package of a collection of XML files, containing textual content and markup for formatting or layout definitions. The docx file is decompressed into a set of XML files using the open source ZLib library, and then the standard text content is extracted from these decompressed files, deleting all fonts and paragraph layouts. In this way, the program generates a plain text file containing a list of sentences.
Unlike english sentences, chinese sentences have no space to divide characters into words, which results in difficulty in evaluating relationships between words, and therefore the Jieba library is used to divide each sentence in the ISO19650 standard text into an array of multiple chinese words, each chinese word containing one or more chinese characters. In addition, due to the limited ability of Jieba to recognize new words, in order to enhance the ability of Jieba to recognize new words, the ISO19650 professional vocabulary library needs to be used. For example, the proprietary vocabulary "public data environment" is often segmented by Jieba into three words "public", "data", "environment" without using the ISO19650 professional vocabulary library, which is clearly not an intended result. Therefore, the correct word segmentation results can be obtained after incorporating the ISO19650 professional vocabulary containing the professional term "public data environment" into the Jieba private dictionary.
After word segmentation is completed, the professional terms are replaced by the hypernyms of the professional terms, and the replaced hypernyms are words with high recognition and analysis accuracy of the DDParser. To implement the above replacement, each term vocabulary is defined with its hypernyms in the ISO19650 lexicon. For example, the hypernym of the "public data environment" is the "information source". Because many of the professions in ISO19650 will make the next step of dependency parsing difficult, they are replaced with hypernyms that are easier to recognize and analyze.
2. Semantic information extraction of ISO19650 standard text:
2.1, obtaining the dependency syntax relation:
the structure of a sentence can be obtained through dependency syntactic analysis, the process marks the part of speech of each Chinese word, determines the dependency relationship between the part of speech and other components in the sentence, and finally interprets the sentence into a syntactic dependency tree, so that the subsequent computer can understand the sentence more conveniently. An open-source Baidu dependency parser (DDParser) is utilized to analyze dependencies between a series of words.
Referring to fig. 2, DDParser is a model for developing syntactic analysis based on the deep double affine attention (deep biaffinetention) model. The ith word eiIs its embedded vector eword iAnd the character-level LSTM vector charLSTM (w)i) The series of (a) is represented by the following formula:
ei=eword i⊕charLSTM(wi)
among them, charLSTM (w)i) Is a concatenated vector produced by sequentially feeding each character in the ith word to the BiLSTM layer. Then each eiIs input into three layers of BiLSTM to generate a high-dimensional vector ri. Subsequently, each r is reduced by multi-layer perception (MLP)iThe dimension of the vector. This kind of dimensionality reductionInformation that has little influence on the dependency relationship between words can be excluded. Finally, the dependency and its syntactic type are identified using double affine attention.
Each word can be regarded as a head item or a dependent item, and after the word vector is input into the depth double affine attention model, the dependent arcs S are respectively calculatedi arcScore of sum relationship Si rel. From these two scores, the dependency between the corresponding two words can be deduced. For the example in FIG. 2, the syntactic relationship between "Is" and "information" Is evaluated as Verb-object (VOB).
DDParser has been trained on a manually labeled data set DuCTB, with annotated dependencies between two Chinese words noted in the library. DuCBT has 24 part-of-speech tags and 14 dependencies. Table 1 below lists some of the dependencies.
TABLE 1
Dependency relationships Explanation of the invention
HED Center of the whole sentence
SBV Relationship of Subjects to prediction (Head)
VOB Prediction (Head) versus target relationship
ATT Relationship between idioms and nouns (Head)
ADV Relationship between Zhuke and verb (Head)
POB Relationship of preposition and object (Head)
COO Co-occurrence between two words
FIG. 3 illustrates an example of a dependency syntax tree. A directed arc link is issued from the core word to the modifiers, each directed arc link representing a grammatical dependency. For example, there is an arc between the subject "public data environment" and the predicate "yes" indicating a dependency "SBV", while the object "information source" is also "dependent" VOB "with the same predicate" yes ". It is further possible to infer the semantic relationship between two words from these two dependencies.
2.2, semantic relation reasoning:
the DDParser mainly handles the syntactic structure of sentences, not the semantic relations. Syntactic relations focus on passing information on the syntactic structure of a sentence, while semantic information extraction focuses on semantic relationships between entities (concepts). Thus, semantic relationships are more important for understanding text, and associations between sentences can be subsequently inferred from these semantic relationships.
In order to deduce semantic information from the syntactic structure of a sentence, a set of mapping rules from one or more dependencies to semantic relationships is designed, and table 2 below lists typical mapping rules. Using these mapping rules, the dependency syntax relationship derived by the DDParser can be converted into a semantic relationship. The letters A, I, F, O in the table indicate entity type, information, function, and others, respectively.
TABLE 2
Figure BDA0003582634090000071
Figure BDA0003582634090000081
In addition, in order to better explain the semantic relationship, through the manual analysis of the ISO19650 series, an ontology model of ISO19650 is proposed, the ontology model defines three core concepts of information, roles and behaviors, and the semantic relationship among the three core concepts is defined and is shown in fig. 4.
Referring to FIG. 5, the transformation of the dependencies in FIG. 3 into semantic relationships is illustrated. For example, following the mapping rules described in table 2, two syntactic relations SBV ("common data environment" and "yes") and VOB ("information source" and "yes") are mapped to a semantic relation "lower meaning" between subject and object. That is, "public data environment" is a hyponym for "information sources". Meanwhile, the ATT (idiomatic relationship) between the "information source" and the "endorsement" is converted into the semantic relationship "negotiation" between the two entities, and the core word "information source" of the "negotiation" is of the "information" type, while the modifier "endorsement" is of the "function" type and is a trigger of the semantic relationship "negotiation". In addition, two other syntactic relations, ATT and ADV, are converted into two semantic relations, attribute and constraint.
3. Graph database-based clause association relationship reasoning:
when referring to the fact that the ISO19650 standard text contains a plurality of interrelated sentences, the sentences are interrelated either through a shared concept or an ontology model of ISO 19650. When referring to the ISO19650 standard series text, it is often necessary to determine references or dependencies between individual sentences. Therefore, the graph database Neo4J is used to describe the semantic relationships analyzed in the aforementioned standard sentences, the ISO19650 ontology model and the links between the two. Compared with the SQL relational database, rich attribute description can be defined by the opposite edges in Neo4J, so that the connection operation required by the SQL database can be greatly reduced, and the query and reasoning speed is improved.
Specifically, one node is used to represent a named entity and one directed edge is used to represent a binary relationship between two nodes. Both nodes and edges may have multiple attributes. The association of sentences with ontology models may also be represented by a graph database Neo 4J. In Neo4J, a sequence of words in the same sentence is represented by the "enter" relationship from the previous word to the next word. For example, FIG. 6 shows that syntactic relationships are represented by a "Depend _ On" relationship of modifiers to the core. The type of syntactic relationship is represented as an attribute of a "depended _ On" edge.
The graphical structure of the Neo4J database may be further used for knowledge reasoning. Fig. 7 illustrates a typical case of reasoning using a graph database Neo 4J. The sentence in section 3.3.15 of part 1 of ISO19650 indicates "common" modifier word "endorse", and the endorsed hypernym is "behavior", and it can be inferred by ontology model that "behavior" must have "performer". The "roles" are actors as defined in section 3.2.1 of part 1 of ISO 19650. Meanwhile, in the second sentence of fig. 7, there are three subordinate words "person", "organization", and "unit" of "performer". This means that those persons, organizations and units participating in the project construction process should agree on the composition of the common data environment, since it is a shared information source. The reasoning process can be realized by adopting the following Cypher query codes:
MATCH (a public data environment)
WHERE (a) [ + ] - (b: common)
MATCH (c: role)
WHERE(b)-[:Trigger_Word]->(c)
MATCH(d)
WHERE(c)-[:Hyponymy]-[d]
RETURNd
Engineering infrastructure project management requires that BIM information be shared among all project participants. Project participants require a widely recognized BIM information management collaboration framework. The ISO19650 family of standards provides concepts, principles and procedures for establishing a reliable source of shared information. However, the series of ISO19650 standards consisting of 5 parts constitutes a very complex text system, and infrastructure builders would like to have an efficient tool to capture semantic information in these standards to better discover and infer associations and references between standard articles.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A multi-sentence correlation analysis method of ISO19650 standard texts is characterized by comprising the following steps:
step S1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;
step S2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;
step S3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;
step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.
2. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S1 comprises:
step S1.1: acquiring a text file of a Chinese edition ISO19650 standard series;
step S1.2: extracting sentences according to each standard entry, and performing sentence segmentation;
step S1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.
3. The method for multi-sentence correlation analysis of ISO 19650-standard text according to claim 2, wherein the text file is a docx file, the docx file is decompressed into a set of XML files by using an open source ZLib library, and then the XML files are analyzed from the decompressed files according to the entry coding rules of ISO 19650-standard series, standard entry contents are extracted from the XML files, all font and paragraph layouts are deleted, and finally a plain text file containing a sentence list is generated.
4. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
5. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
6. A multi-sentence correlation analysis system for ISO19650 standard text, comprising:
module M1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;
module M2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;
module M3: reasoning is carried out on the dependence relationship among the words in the sentences according to the conversion rule from the dependence relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;
module M4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.
7. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M1 comprises:
module M1.1: acquiring text files of a Chinese edition ISO19650 standard series;
module M1.2: extracting sentences according to each standard entry, and performing sentence segmentation;
module M1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.
8. The ISO19650 multi-lingual correlation analysis system for standard text according to claim 7, wherein the text file is a docx file, the open source ZLib library is used to decompress the docx file into a set of XML files, and then the XML files are analyzed from the decompressed files according to the entry encoding rules of the ISO19650 standard series, so as to extract the standard entry contents from the decompressed files, delete all font types and paragraph types, and finally generate a plain text file containing the sentence list.
9. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.
10. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.
CN202210355791.4A 2022-04-06 2022-04-06 Multi-sentence correlation analysis method and system for ISO19650 standard text Pending CN114742068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210355791.4A CN114742068A (en) 2022-04-06 2022-04-06 Multi-sentence correlation analysis method and system for ISO19650 standard text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210355791.4A CN114742068A (en) 2022-04-06 2022-04-06 Multi-sentence correlation analysis method and system for ISO19650 standard text

Publications (1)

Publication Number Publication Date
CN114742068A true CN114742068A (en) 2022-07-12

Family

ID=82280132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210355791.4A Pending CN114742068A (en) 2022-04-06 2022-04-06 Multi-sentence correlation analysis method and system for ISO19650 standard text

Country Status (1)

Country Link
CN (1) CN114742068A (en)

Similar Documents

Publication Publication Date Title
US20220269865A1 (en) System for knowledge acquisition
US11914954B2 (en) Methods and systems for generating declarative statements given documents with questions and answers
US6658377B1 (en) Method and system for text analysis based on the tagging, processing, and/or reformatting of the input text
US9152623B2 (en) Natural language processing system and method
Tiedemann Recycling translations: Extraction of lexical data from parallel corpora and their application in natural language processing
US9619464B2 (en) Networked language translation system and method
JP2005535007A (en) Synthesizing method of self-learning system for knowledge extraction for document retrieval system
Ehsan et al. Grammatical and context‐sensitive error correction using a statistical machine translation framework
Novák et al. Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence
US20110040553A1 (en) Natural language processing
Rodrigues et al. Advanced applications of natural language processing for performing information extraction
JP2020190970A (en) Document processing device, method therefor, and program
Van Der Goot et al. Lexical normalization for code-switched data and its effect on POS-tagging
Wax Automated grammar engineering for verbal morphology
Schubotz Augmenting mathematical formulae for more effective querying & efficient presentation
Yan et al. Chemical name extraction based on automatic training data generation and rich feature set
Terdalkar et al. Framework for question-answering in Sanskrit through automated construction of knowledge graphs
Bisikalo et al. Linguistic analysis method of ukrainian commercial textual content for data mining.
Rahat et al. Parsa: An open information extraction system for Persian
Nghiem et al. Using MathML parallel markup corpora for semantic enrichment of mathematical expressions
CN114742068A (en) Multi-sentence correlation analysis method and system for ISO19650 standard text
Robertson Word sense disambiguation for finnish with an application to language learning
WO2020026229A2 (en) Proposition identification in natural language and usage thereof
Khalil et al. Challenges in information retrieval from unstructured arabic data
Pandit et al. Ontology-guided extraction of complex nested relationships

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination