CN114742068A

CN114742068A - Multi-sentence correlation analysis method and system for ISO19650 standard text

Info

Publication number: CN114742068A
Application number: CN202210355791.4A
Authority: CN
Inventors: 吴冰; 刘伟军; 宋元斌; 胡锡燎; 诸言涵; 曹金浩; 张波; 陈科技; 王淑红; 王婷婷; 张琳琳; 杨嘉睿; 陈赛慧; 杨铁涵; 黄江倩; 林贺
Original assignee: Shanghai Jiaotong University; Economic and Technological Research Institute of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Shanghai Jiaotong University; Economic and Technological Research Institute of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-12

Abstract

The invention provides a multi-sentence correlation analysis method and a multi-sentence correlation analysis system for ISO19650 standard texts, which relate to the technical field of information processing, and comprise the following steps: step S1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences; step S2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences; step S3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence; step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences. The method can overcome the difficulty in extracting semantic information of the ISO19650 standard Chinese text caused by the shortage of the corpus, and is also helpful for solving the difficulty in automatic analysis of association and reference among ISO19650 sentences.

Description

Multi-sentence correlation analysis method and system for ISO19650 standard text

Technical Field

The invention relates to the technical field of information processing, in particular to a method for analyzing association between multiple sentences in ISO19650 standard series texts based on NLP and an ontology model.

Background

Engineering project development requires all participants to convey explicit information in a timely manner. In addition to the IFC file format, they also require an information management framework to support their collaboration. The ISO19650 family of standards provides such a framework to establish a reliable source of information. Since the ISO19650 family of standards, consisting of 5 parts, constitutes a complex system, the construction industry desires to be able to capture the semantic information in these standards.

However, manually extracting semantic information in the ISO19650 standard series is not only time-consuming, but also costly. Therefore, a semantic information extraction method based on NLP needs to be specially developed, so as to automatically analyze the association and reference relationship between each standard article by means of the ontology model of ISO19650 standard.

The invention patent with publication number CN110096692B discloses a semantic information processing method and a device, the semantic information processing method comprises dividing a question stem into two parts of a known condition and a conclusion according to the obtained question stem; extracting the explicit semantic information in the known conditions and the conclusions according to the obtained known conditions and the conclusions; when implicit semantic information exists in the known conditions and/or conclusions, extracting the implicit semantic information in the known conditions and/or conclusions; and combining the extracted explicit semantic information and the extracted implicit semantic information to obtain the semantic information of the question stem.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multi-sentence correlation analysis method and system of ISO19650 standard texts.

According to the multi-sentence correlation analysis method and system of the ISO19650 standard text, the scheme is as follows:

in a first aspect, a method for multi-sentence association analysis of ISO19650 standard text is provided, the method comprising:

step S1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;

step S2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;

step S3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;

step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing the link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among a plurality of sentences.

Preferably, the step S1 includes:

step S1.1: acquiring a text file of a Chinese edition ISO19650 standard series;

step S1.2: extracting sentences according to each standard entry, and performing sentence segmentation;

step S1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.

Preferably, the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLib library, then the XML files are analyzed from the decompressed files according to the entry coding rule of the ISO19650 standard series, standard entry contents are extracted from the XML files, all fonts and paragraph typesetting are deleted, and finally a plain text file containing a statement list is generated.

Preferably, the step S2 includes: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

Preferably, the step S3 includes: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.

In a second aspect, there is provided a multi-sentence correlation analysis system of ISO19650 standard text, the system comprising:

module M1: performing word segmentation and word change processing on sentences in an ISO19650 standard series to obtain preprocessed sentences;

module M2: performing dependency syntax analysis on the preprocessed sentences to obtain dependency relationships among words in the sentences;

module M3: reasoning is carried out on the dependency relationship among the words in the sentences according to the conversion rule from the dependency relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;

module M4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.

Preferably, said module M1 comprises:

module M1.1: acquiring a text file of a Chinese edition ISO19650 standard series;

module M1.2: extracting sentences according to each standard entry, and performing sentence segmentation;

module M1.3: and (5) changing words obtained by word segmentation, and replacing the professional terms with hypernyms.

Preferably, said module M2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

Preferably, said module M3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the domain semantic relation is deduced from the sentence relation through the mapping rule, the difficulty caused by the shortage of the corpus is greatly overcome, and the feasibility and the practicability of the provided information extraction method are verified through experiments.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a block diagram of a dependency syntax tree;

FIG. 3 is an example of a dependency tree for a sentence;

FIG. 4 is a diagram of exemplary semantic relationships;

FIG. 5 is a diagram illustrating the conversion from syntactic to semantic relationships;

FIG. 6 is a knowledge-graph stored in a Neo4J graph database;

FIG. 7 is a case of inferring associations between standard statements.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the concept of the invention. All falling within the scope of the present invention.

The embodiment of the invention provides a method for analyzing association between multiple sentences based on NLP standard text, which specifically comprises the following steps of:

step S1: and performing word segmentation and word change processing on the sentences in ISO19650 to obtain preprocessed sentences.

The step S1 specifically includes:

step S1.1: files of the ISO19650 standard series chinese translated text are obtained.

Step S1.2: extracting sentences according to standard entries, and segmenting words of the sentences.

Step S1.3: and (3) changing words of the words (named entities corresponding to the words in the sentence) obtained by word segmentation, and replacing the professional terms with the superior words, wherein the replaced superior words are words with high identification and analysis accuracy of the DDParser.

The method comprises the steps of extracting standard entry contents from a set of XML files, deleting all font types and paragraph typesetting, and finally generating a plain text file containing a statement list, wherein the text file is a docx file, the docx file is decompressed into a group of XML files by using an open source ZLIb library, the XML files are analyzed from the decompressed files according to the entry coding rule of ISO19650, and the plain text file is generated.

Step S2: and performing dependency syntax analysis on the preprocessed sentences to acquire dependency relations among the words in the sentences. And performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

Step S3: and reasoning according to the dependency relationship to semantic relationship conversion rule to obtain the semantic relationship between the words in the single sentence. And semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.

Step S4: the semantic relation in a single sentence is imported into a graph database, an ontology model of the ISO standard is imported into the graph database, the links between the words (named entities corresponding to the words in the sentence) in each sentence and the words in the ontology model are established, and the association relation among a plurality of sentences is deduced.

The invention also provides a system for analyzing the association between multiple sentences based on the NLP standard text, which specifically comprises the following steps:

module M1: and performing word segmentation and word change processing on the sentences in ISO19650 to obtain preprocessed sentences.

The module M1 specifically includes:

module M1.1: files of the ISO19650 standard series chinese translated text are obtained.

Module M1.2: extracting sentences according to standard entries, and segmenting words of the sentences.

Module M1.3: and (3) changing words of the words (named entities corresponding to the words in the sentence) obtained by word segmentation, and replacing the professional terms with the superior words, wherein the replaced superior words are words with high identification and analysis accuracy of the DDParser.

Module M2: and performing dependency syntax analysis on the preprocessed sentences to acquire dependency relations among the words in the sentences. And performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

Module M3: and reasoning according to the dependency relationship to semantic relationship conversion rule to obtain the semantic relationship between the words in the single sentence. And semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.

Next, the present invention will be described in more detail.

The invention provides a method for analyzing association between multiple sentences in an ISO19650 standard series text based on an NLP and an ontology model, in particular to a method for extracting semantic information and analyzing association between sentences of an ISO19650 standard series Chinese translation text based on the NLP and the ontology model, which comprises the following steps:

1. pretreatment of ISO19650 texts:

referring to fig. 1, a preprocessing flow of sentences in the ISO19650 standard series is shown, wherein the sentences are firstly segmented and then replaced by hypernyms aiming at the professional terms.

The original standard text, the English version, is first translated into Chinese and stored in Microsoft docx format. All sentences in each part of the ISO standard are then extracted using the pre-processing procedure shown in fig. 1.

The docx file is actually a compressed package of a collection of XML files, containing textual content and markup for formatting or layout definitions. The docx file is decompressed into a set of XML files using the open source ZLib library, and then the standard text content is extracted from these decompressed files, deleting all fonts and paragraph layouts. In this way, the program generates a plain text file containing a list of sentences.

Unlike english sentences, chinese sentences have no space to divide characters into words, which results in difficulty in evaluating relationships between words, and therefore the Jieba library is used to divide each sentence in the ISO19650 standard text into an array of multiple chinese words, each chinese word containing one or more chinese characters. In addition, due to the limited ability of Jieba to recognize new words, in order to enhance the ability of Jieba to recognize new words, the ISO19650 professional vocabulary library needs to be used. For example, the proprietary vocabulary "public data environment" is often segmented by Jieba into three words "public", "data", "environment" without using the ISO19650 professional vocabulary library, which is clearly not an intended result. Therefore, the correct word segmentation results can be obtained after incorporating the ISO19650 professional vocabulary containing the professional term "public data environment" into the Jieba private dictionary.

After word segmentation is completed, the professional terms are replaced by the hypernyms of the professional terms, and the replaced hypernyms are words with high recognition and analysis accuracy of the DDParser. To implement the above replacement, each term vocabulary is defined with its hypernyms in the ISO19650 lexicon. For example, the hypernym of the "public data environment" is the "information source". Because many of the professions in ISO19650 will make the next step of dependency parsing difficult, they are replaced with hypernyms that are easier to recognize and analyze.

2. Semantic information extraction of ISO19650 standard text:

2.1, obtaining the dependency syntax relation:

the structure of a sentence can be obtained through dependency syntactic analysis, the process marks the part of speech of each Chinese word, determines the dependency relationship between the part of speech and other components in the sentence, and finally interprets the sentence into a syntactic dependency tree, so that the subsequent computer can understand the sentence more conveniently. An open-source Baidu dependency parser (DDParser) is utilized to analyze dependencies between a series of words.

Referring to fig. 2, DDParser is a model for developing syntactic analysis based on the deep double affine attention (deep biaffinetention) model. The ith word e_iIs its embedded vector e^word _iAnd the character-level LSTM vector charLSTM (w)_i) The series of (a) is represented by the following formula:

e_i＝e^word _i⊕charLSTM(w_i)

among them, charLSTM (w)_i) Is a concatenated vector produced by sequentially feeding each character in the ith word to the BiLSTM layer. Then each e_iIs input into three layers of BiLSTM to generate a high-dimensional vector r_i. Subsequently, each r is reduced by multi-layer perception (MLP)_iThe dimension of the vector. This kind of dimensionality reductionInformation that has little influence on the dependency relationship between words can be excluded. Finally, the dependency and its syntactic type are identified using double affine attention.

Each word can be regarded as a head item or a dependent item, and after the word vector is input into the depth double affine attention model, the dependent arcs S are respectively calculated_i ^arcScore of sum relationship S_i ^rel. From these two scores, the dependency between the corresponding two words can be deduced. For the example in FIG. 2, the syntactic relationship between "Is" and "information" Is evaluated as Verb-object (VOB).

DDParser has been trained on a manually labeled data set DuCTB, with annotated dependencies between two Chinese words noted in the library. DuCBT has 24 part-of-speech tags and 14 dependencies. Table 1 below lists some of the dependencies.

TABLE 1

Dependency relationships	Explanation of the invention
		HED	Center of the whole sentence
SBV	Relationship of Subjects to prediction (Head)
		VOB	Prediction (Head) versus target relationship
ATT	Relationship between idioms and nouns (Head)
		ADV	Relationship between Zhuke and verb (Head)
POB	Relationship of preposition and object (Head)
		COO	Co-occurrence between two words

FIG. 3 illustrates an example of a dependency syntax tree. A directed arc link is issued from the core word to the modifiers, each directed arc link representing a grammatical dependency. For example, there is an arc between the subject "public data environment" and the predicate "yes" indicating a dependency "SBV", while the object "information source" is also "dependent" VOB "with the same predicate" yes ". It is further possible to infer the semantic relationship between two words from these two dependencies.

2.2, semantic relation reasoning:

the DDParser mainly handles the syntactic structure of sentences, not the semantic relations. Syntactic relations focus on passing information on the syntactic structure of a sentence, while semantic information extraction focuses on semantic relationships between entities (concepts). Thus, semantic relationships are more important for understanding text, and associations between sentences can be subsequently inferred from these semantic relationships.

In order to deduce semantic information from the syntactic structure of a sentence, a set of mapping rules from one or more dependencies to semantic relationships is designed, and table 2 below lists typical mapping rules. Using these mapping rules, the dependency syntax relationship derived by the DDParser can be converted into a semantic relationship. The letters A, I, F, O in the table indicate entity type, information, function, and others, respectively.

TABLE 2

In addition, in order to better explain the semantic relationship, through the manual analysis of the ISO19650 series, an ontology model of ISO19650 is proposed, the ontology model defines three core concepts of information, roles and behaviors, and the semantic relationship among the three core concepts is defined and is shown in fig. 4.

Referring to FIG. 5, the transformation of the dependencies in FIG. 3 into semantic relationships is illustrated. For example, following the mapping rules described in table 2, two syntactic relations SBV ("common data environment" and "yes") and VOB ("information source" and "yes") are mapped to a semantic relation "lower meaning" between subject and object. That is, "public data environment" is a hyponym for "information sources". Meanwhile, the ATT (idiomatic relationship) between the "information source" and the "endorsement" is converted into the semantic relationship "negotiation" between the two entities, and the core word "information source" of the "negotiation" is of the "information" type, while the modifier "endorsement" is of the "function" type and is a trigger of the semantic relationship "negotiation". In addition, two other syntactic relations, ATT and ADV, are converted into two semantic relations, attribute and constraint.

3. Graph database-based clause association relationship reasoning:

when referring to the fact that the ISO19650 standard text contains a plurality of interrelated sentences, the sentences are interrelated either through a shared concept or an ontology model of ISO 19650. When referring to the ISO19650 standard series text, it is often necessary to determine references or dependencies between individual sentences. Therefore, the graph database Neo4J is used to describe the semantic relationships analyzed in the aforementioned standard sentences, the ISO19650 ontology model and the links between the two. Compared with the SQL relational database, rich attribute description can be defined by the opposite edges in Neo4J, so that the connection operation required by the SQL database can be greatly reduced, and the query and reasoning speed is improved.

Specifically, one node is used to represent a named entity and one directed edge is used to represent a binary relationship between two nodes. Both nodes and edges may have multiple attributes. The association of sentences with ontology models may also be represented by a graph database Neo 4J. In Neo4J, a sequence of words in the same sentence is represented by the "enter" relationship from the previous word to the next word. For example, FIG. 6 shows that syntactic relationships are represented by a "Depend _ On" relationship of modifiers to the core. The type of syntactic relationship is represented as an attribute of a "depended _ On" edge.

The graphical structure of the Neo4J database may be further used for knowledge reasoning. Fig. 7 illustrates a typical case of reasoning using a graph database Neo 4J. The sentence in section 3.3.15 of part 1 of ISO19650 indicates "common" modifier word "endorse", and the endorsed hypernym is "behavior", and it can be inferred by ontology model that "behavior" must have "performer". The "roles" are actors as defined in section 3.2.1 of part 1 of ISO 19650. Meanwhile, in the second sentence of fig. 7, there are three subordinate words "person", "organization", and "unit" of "performer". This means that those persons, organizations and units participating in the project construction process should agree on the composition of the common data environment, since it is a shared information source. The reasoning process can be realized by adopting the following Cypher query codes:

MATCH (a public data environment)

WHERE (a) [ + ] - (b: common)

MATCH (c: role)

WHERE(b)-[:Trigger_Word]->(c)

MATCH(d)

WHERE(c)-[:Hyponymy]-[d]

RETURNd

Engineering infrastructure project management requires that BIM information be shared among all project participants. Project participants require a widely recognized BIM information management collaboration framework. The ISO19650 family of standards provides concepts, principles and procedures for establishing a reliable source of shared information. However, the series of ISO19650 standards consisting of 5 parts constitutes a very complex text system, and infrastructure builders would like to have an efficient tool to capture semantic information in these standards to better discover and infer associations and references between standard articles.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A multi-sentence correlation analysis method of ISO19650 standard texts is characterized by comprising the following steps:

step S4: and importing the semantic relation in a single sentence into a graph database, importing an ontology model of the ISO standard into the graph database, establishing a link between the words in each sentence and the words in the ontology model, and reasoning the incidence relation among the sentences.

2. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S1 comprises:

step S1.1: acquiring a text file of a Chinese edition ISO19650 standard series;

3. The method for multi-sentence correlation analysis of ISO 19650-standard text according to claim 2, wherein the text file is a docx file, the docx file is decompressed into a set of XML files by using an open source ZLib library, and then the XML files are analyzed from the decompressed files according to the entry coding rules of ISO 19650-standard series, standard entry contents are extracted from the XML files, all font and paragraph layouts are deleted, and finally a plain text file containing a sentence list is generated.

4. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

5. The method for multiple sentence correlation analysis according to ISO19650 standard text, wherein the step S3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.

6. A multi-sentence correlation analysis system for ISO19650 standard text, comprising:

module M3: reasoning is carried out on the dependence relationship among the words in the sentences according to the conversion rule from the dependence relationship to the semantic relationship to obtain the semantic relationship among the words in a single sentence;

7. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M1 comprises:

module M1.1: acquiring text files of a Chinese edition ISO19650 standard series;

8. The ISO19650 multi-lingual correlation analysis system for standard text according to claim 7, wherein the text file is a docx file, the open source ZLib library is used to decompress the docx file into a set of XML files, and then the XML files are analyzed from the decompressed files according to the entry encoding rules of the ISO19650 standard series, so as to extract the standard entry contents from the decompressed files, delete all font types and paragraph types, and finally generate a plain text file containing the sentence list.

9. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M2 comprises: and performing syntax tree analysis on the sentence through a dependency relationship analyzer, marking a part of speech for each word in the sentence, finding out a central word in the sentence, determining a non-central word associated with the central word, starting the next round of searching for the relevant non-central word by taking the non-central word as the central word, and finally obtaining a multilayer dependency syntax tree.

10. The ISO19650 multi-sentence correlation analysis system of standard text according to claim 6, wherein the module M3 comprises: and semantic relation reasoning, namely designing a mapping rule from the dependency relation to the semantic relation, and converting the dependency syntax tree into a binary semantic relation according to the mapping rule.