CN112395854A

CN112395854A - Standard element consistency inspection method

Info

Publication number: CN112395854A
Application number: CN202011386161.0A
Authority: CN
Inventors: 王双; 高昂; 程越; 朱虹; 万利; 李柏晨
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-02-23
Anticipated expiration: 2040-12-02
Also published as: CN112395854B

Abstract

The application provides a consistency check method aiming at standard elements, firstly, standard knowledge element extraction is carried out based on rules, unstructured standard files are converted into regular knowledge element storage models, terms, normative citation files and coding tables in the unstructured standard files are extracted, and vectorization processing is carried out on the unstructured standard files. Furthermore, the consistency of the terms and the normative citations in the standards is checked, and the normative and harmonious consistency of the terms and the normative citations among the standards is ensured. Particularly, aiming at important basic universality standards such as information classification codes, manual intervention in database migration work caused by implementation of a new version of the information classification code standard is reduced through code consistency inspection, and the work efficiency is improved. Furthermore, the knowledge element model can be used as a neuron of a neural network input layer, the neural network is trained by using a standard file data set, and the processing of mass file data is realized more efficiently.

Description

Standard element consistency inspection method

Technical Field

The application relates to a method for consistency checking of elements in a standard.

Background

The standard refers to "a normative document that is agreed upon and approved by a recognized agency, co-used and reused for optimal order within a certain range". The standard has a strict structure and consists of normative elements and data elements, wherein the normative elements comprise names, ranges, normative citations, terms and definitions, codes and abbreviations, normative annexes and the like; the data appendix includes cover, current, introduction, data appendix, references, index, etc. The normative elements are the core part of the standard, and although not all normative elements need to be included in one standard, the normative elements must be written with the correctness and consistency of other standards, such as the consistency of terms and normative citations, which determine the consistency of different standards and the coordination between different versions of the standards. In particular, for important basic universal standards such as information classification coding standards, consistency check between codes has important significance for migration of information system databases and information exchange between databases. At present, the management of standard files mainly stays at the data management level, and a plurality of standard file management platforms (such as international standard file retrieval platforms of international organization for standardization (ISO), International Electrotechnical Commission (IEC), and the like, national standard full-text open systems, national standard information public service platforms, national standard file sharing service platforms, standard search, and other domestic standard file management platforms) mainly provide retrieval and distribution services of standard data, and lack fine-grained data analysis of standard files, so that an effective inspection means is lacked for phenomena of crossing, repetition or inconsistency possibly occurring between standards, and certain influences are brought to the establishment and implementation of standards.

In order to solve the technical problems, the application provides a method for checking consistency of standard elements, which comprises the steps of firstly, extracting standard knowledge elements based on rules, converting unstructured standard files into regular knowledge element storage models, extracting terms, normative citations and coding tables in the unstructured standard files, and vectorizing the regular knowledge element storage models. Furthermore, the consistency of the terms and the normative citations in the standards is checked, and the normative and harmonious consistency of the terms and the normative citations among the standards is ensured. Particularly, aiming at important basic universality standards such as information classification codes, manual intervention in database migration work caused by implementation of a new version of the information classification code standard is reduced through code consistency inspection, and the work efficiency is improved. Furthermore, the knowledge element model can be used as a neuron of a neural network input layer, the neural network is trained by using a standard file data set, and the processing of mass file data is realized more efficiently.

Disclosure of Invention

The invention aims to ensure the normalization and coordination consistency of terms and normative citation files among standards through the consistency test of the terms, the normative citation files and the classified codes in the standards, simultaneously realize the automatic mapping of the corresponding relations such as one-to-one, one-to-many, many-to-one, fuzzy correspondence and the like among the codes, reduce the manual intervention in the database migration work caused by the implementation of a new version of the information classified code standard and improve the working efficiency.

In order to achieve the above object, the present application provides a method for checking consistency of standard elements, comprising: (1) extracting standard knowledge elements based on rules, and converting standard files into knowledge element storage models; (2) establishing a vector storage model of terms, normative reference files and a coding table, and carrying out normalized storage on the vector storage model; (3) the consistency checking step aiming at the terms comprises the steps of firstly searching the terms to obtain a term set containing the same or similar terms; further realizing consistency check on terms with the same name and similarity calculation of similar terms based on the term vector model; (4) the step of checking the consistency of the normative cited document comprises the steps of firstly retrieving the normative cited document to obtain all standard clause sets containing the cited document. And judging according to the date reference and the date reference, and whether a specific clause combination condition is referred, and carrying out consistency comparison on the standard content and corresponding clauses or the whole text in the normative citation file.

The knowledge elements are mutually independent units capable of characterizing knowledge.

In particular, the application also aims at carrying out coding comparison analysis between new and old versions of the information classification coding standard.

The implementation steps of the code alignment analysis comprise: (1) respectively extracting coding tables of two standards aiming at the new and old version standards needing to be subjected to coding comparison; (2) constructing a knowledge element storage model of the coding table; (3) determining the mapping relation from the same-name class: (4) and further determining the mapping relation between the non-homonymous classes.

Drawings

FIG. 1 is a flow of a consistency check of the present application;

FIG. 2 is a system search portal according to the present application;

FIG. 3 is an example (partial) of a coding table as used herein;

FIG. 4 shows an example of the national standard for GB/T14885-2010 fixed asset Classification and code.

Detailed Description

The application describes a method for checking the consistency of standard elements, which comprises the consistency check of terms in the standard, the consistency check of a normative citation document and the code consistency check aiming at an information classification coding standard. The method provided by the application is beneficial to solving the phenomena of cross repetition of normative elements in the standard and the like, and meanwhile, the automatic identification of mapping relations such as one-to-one, one-to-many, many-to-one, fuzzy matching and the like between the coding tables is completed through the construction of the semantic similarity model and the structural similarity model, so that the manual intervention is reduced, and the efficiency of the coding comparison work is improved.

Figure 1 shows the flow of the consistency check.

The implementation steps of the rule-based standard knowledge element extraction and normalized storage specifically comprise:

1. converting the unstructured standard file into a regular knowledge element storage model, wherein the definition:

p ═ { p | p is a standard file }

The element storage model T of the standard file p is { T | T ∈ p, T is a node of p, i.e., the element }

2. Extracting the terms, the normative citation files and the coding table knowledge elements, and establishing a vector storage model of the terms, the normative citation files and the coding table knowledge elements, which specifically comprises the following steps:

if t is a term, the term knowledge element is extracted and stored according to the following five-tuple. Wherein CName is the Chinese name of the term, EName is the English name of the term, Des is the definition of the term, Note is the footnote of the term, and Qut is the reference file information of the term.

T＝<CName，EName，Des，Note，Qut>

And if t is the normative reference file, extracting and storing the normative reference file according to the following quadruple. Wherein, SName is the Chinese name of the reference standard, SNum is the sequence number of the reference standard, SYear is the year code number (date) of the reference standard, and Clause is a certain term of the reference standard.

T＝<SName，SNum，SYear，Clause>

And if t is the coding table, extracting and storing the coding table knowledge element according to the following quintuple group. Wherein, the ParentItem is a parent class of a certain class, the childltem is a subclass of a certain class, the Itemcode is a code of the class, the ItemName is a name of the class, and the Description is Description information.

T＝<ParentItem，ChildItem，ItemCode，ItemName，Description>

Secondly, the consistency checking step based on the normalized storage model is as follows:

1. the implementation steps of the term consistency check specifically include:

1.1, carrying out fuzzy retrieval on a certain term from a standard database to obtain a term set;

1.2 judging whether the terms are same-name terms or similar terms according to the CName based on the term quintuple vector storage model;

1.3 if the terms are same-name terms, the quadruples of EName, Des, Note and Qut in the term vector must be completely consistent;

1.4 if the terms are not the same name terms, similarity calculation of the terms is performed. If the two terms are respectively A and B, firstly, respectively vectorizing the definitions of the two terms to obtain two text vectors DesA and DesB, defining an included angle between the two text vectors as theta, and then the similarity between the definitions of the two terms is a cosine value of the included angle of the two text vectors, wherein the calculation formula is as follows:

the closer the resulting similarity is to 1, the higher the degree of similarity between the two terms is.

2. The implementation steps of the normative citation file consistency check specifically comprise:

2.1, searching a certain standard from a standard database to obtain all standard sets which reference the standard;

2.2 traversing the canonical citation file four-tuple vector storage model in the set, and judging whether the SYear is referenced by the annotation date according to whether the SYear is empty;

2.3 if the date is quoted, judging whether a specific clause is quoted in the text;

2.4, aiming at the standard set quoted from the same specific clause, carrying out consistency comparison analysis on the standard set quoted from the same specific clause and the specific clause in the quoted file;

2.5 for the standard set which does not reference a specific clause or does not note date reference (does not refer to the specific clause), the consistency comparison analysis is carried out with the whole text of the cited document.

3. The implementation steps of the code consistency check aiming at the information classification coding standard specifically comprise:

3.1 constructing a coding table structure tree according to the parent-child relationship in the coding table knowledge element storage model, thereby determining whether a certain class is a non-leaf class (can be subdivided) or a leaf subclass (cannot be subdivided);

3.2 determine the mapping relationships starting from the same-name class:

3.2.1 if only one same-name class exists in the new standard and the old standard respectively, directly establishing a one-to-one mapping relation;

3.2.2 if one or more of the new and old standards respectively has a plurality of homonymy classes, determining the mapping relation between the homonymy classes by carrying out ambiguity resolution on the structural similarity between class pairs;

3.2.3 based on the processing results of 3.2.1 and 3.2.2, if one of the homonymous classes of a pair is a non-leaf class and the other is a leaf class, then the descendant classes of the non-leaf class are all mapped with the leaf class of the pair in a one-to-many or many-to-one manner.

3.3 determining the mapping relation between the non-homonymous classes in the new and old standards:

3.3.1 calculating the structural and semantic similarity between the non-homonymous classes in the new and old standards;

3.3.2 calculating the class sets which are not mapped in the new standard and the old standard to be N and O respectively;

3.3.3 Loop execution until set N or O is empty or a set maximum number of executions:

taking the class pair C with the maximum structural and semantic similarity in N and O_nAnd C_oIf the similarity is larger than the threshold value alpha, establishing C_nAnd C_oThe mapping relationship between the two;

if C_nAnd C_oIf one of the leaf classes is a leaf class and the other is a non-leaf class, establishing a one-to-many or many-to-one mapping relation according to the method in the step 3;

and updating the set N or O, and removing the classes with the established mapping relation.

The similarity between classes is composed of semantic similarity and structural similarity: sim ═ S_Semantics+T_{Structure of the product}。

The semantic similarity between the new and old classes is mainly calculated based on the class name and the name of the parent class. Setting new and old class namesCalled Cn and Co respectively, and obtaining a word set W after removing stop words after word segmentation_nAnd W_oThen semantic similarity

The structural similarity between the new and old classes consists of class hierarchy similarity, ancestor class set similarity, brother class set similarity and child class set similarity: t is_{Structure of the product}＝T_{Class hierarchy}+T_{Set of ancestor classes}+T_{Sibling collection}+T_{Set of children}。

The class hierarchy similarity is determined by the hierarchy of the new and old classes (set as L)_nAnd L_o) And calculating to obtain: t is_{Class hierarchy}＝1.0/(|L_o-L_n|+1.0)。T_{Set of ancestor classes}、T _{Sibling collection}、T_{Set of children}The result is obtained by calculating the name character strings of all classes in the set according to the calculation formula of the semantic similarity.

The encoding comparison is performed by taking GB/T14885 and 1994 fixed asset Classification and code (old version) and GB/T14885 and 2010 fixed asset Classification and code (new version) as examples. By analyzing the extracted coding table, 8147 classes are shared by GB/T14885-. As shown in table 1:

TABLE 1 Standard New and old edition analysis table

By comparing with the result of expert comparison, the method provided by the invention can automatically complete more than 70% of work, greatly improve the efficiency of code comparison analysis work, and the accuracy and recall ratio obtained according to different threshold settings in the similarity model are shown in table 2.

TABLE 2 accuracy and recall Table

	Rate of accuracy	Recall rate
			Threshold＝2	78.98％	71.27％
Threshold＝2.5	81.73％	71.29％
			Threshold＝3	86.14％	71.12％
Threshold＝3.5	88.95％	70.09％
			Threshold＝4	92.04％	68.92％
Threshold＝4.5	93.21％	66.90％
			Threshold＝5	94.48％	65.97％
Threshold＝5.5	95.87％	64.64％
			Threshold＝6	96.38％	63.89％
Threshold＝6.5	97.62％	62.83％
			Threshold＝7	98.05％	62.14％
Threshold＝7.5	98.08％	61.82％

The partial results obtained from the code alignment analysis are shown in the table below (Threshold ═ 7.5):

TABLE 3 partial results table of code alignment analysis at threshold 7.5

Many embodiments have been described above. Nevertheless, various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used with steps reordered, added, or removed.

Claims

1. A method for checking the consistency of standard elements comprises the following steps:

converting the standard file into a knowledge element storage model;

acquiring a knowledge element from a knowledge element storage model, and establishing a vector storage model;

extracting and storing the knowledge elements by using a vector storage model;

and realizing consistency check based on the vector storage model.

2. The method of claim 1, wherein the intellectual element is a term, a normative citation and/or a coding table.

3. The method of claim 2, wherein when the knowledgebase is a term, the vector storage model is the following five-element vector set: t ═ CName, EName, Des, Note, Qut >, wherein CName is the Chinese name of the term, EName is the English name of the term, Des is the definition of the term, Note is the footnote of the term, Qut is the citation file information of the term.

4. A method according to claim 3, characterized by the steps of:

carrying out fuzzy retrieval on a certain term in a standard database to obtain a term set;

judging whether the terms are same-name terms or similar terms according to the CName based on the term quintuple vector storage model;

if the terms are homonymous terms, the quadruplets of EName, Des, Note and Qut in the term vector must be completely consistent;

if the terms are not the same name terms, similarity calculation of the terms is performed. If the two terms are respectively A and B, firstly, respectively vectorizing the definitions of the two terms to obtain two text vectors DesA and DesB, defining an included angle between the two text vectors as theta, and then the similarity between the definitions of the two terms is a cosine value of the included angle of the two text vectors, wherein the calculation formula is as follows:

5. The method of claim 1, wherein when the knowledgeelement is a canonical citation file, the vector storage model is the following four-tuple: t ═ SName, SNum, SYear, Clause >

Wherein, SName is the Chinese name of the reference standard, SNum is the sequence number of the reference standard, SYear is the year code number (date) of the reference standard, and Clause is a certain term of the reference standard.

6. The method of claim 5, comprising the steps of:

retrieving a certain standard from a standard database to obtain all standard sets which refer to the standard;

traversing a canonical citation file four-tuple vector storage model in the set, and judging whether the SYear is quoted by the date according to whether the SYear is empty;

if the date is quoted, judging whether a specific clause is quoted in the text;

carrying out consistency comparison analysis on the standard set which is quoted from the same specific clause and the specific clause in the quoted file;

and for the standard set which is not referred to by a specific term or is not referred by date, carrying out consistency comparison analysis on the standard set and the whole file to be referred.

7. The method of claim 1, wherein when the knowledgeelement is an encoding table, the vector storage model is the following five-tuple vector set: and T is < ParentItem, ChildItem, Itemcode, ItemName and Description >, wherein ParentItem is a parent class of a certain class, ChildItem is a subclass of the certain class, Itemcode is the code of the class, ItemName is the name of the class, and Description is Description information.

8. The method of claim 7, wherein the coding table structure tree is constructed based on parent-child structure relationships between coding classes to determine whether a class is a non-leaf class or a leaf-child class.

9. The method of claim 8, wherein the mapping is determined starting from a same-name class by the steps of: (1) if only one same-name class exists in the new standard and the old standard respectively, a one-to-one mapping relation is directly established;

(2) if one or more homonymous classes exist in the new standard and the old standard, ambiguity resolution is carried out on the structural similarity between paired homonymous classes to determine the mapping relation between the homonymous classes;

(3) based on the processing results of (1) and (2), if one of the same-name classes of a certain pair of comparison is a non-leaf class and the other is a leaf class, establishing a one-to-many or many-to-one mapping relation between the descendant classes of the non-leaf class and the leaf class of the comparison.

10. The method of claim 8, wherein the step of determining the mapping relationship between the non-homonymous classes in the old and new standards comprises:

(1) calculating the structural and semantic similarity between the non-homonymous classes in the new standard and the old standard;

(2) calculating class sets which are not mapped in the new standard and the old standard to be N and O respectively;

(3) and circularly executing until the set N or O is empty or the set maximum execution times are reached:

taking the class pair Cn and Co with the maximum structural and semantic similarity in N and O, and if the similarity is greater than a threshold value alpha, establishing a mapping relation between Cn and Co;

if one of Cn and Co is a leaf subclass and the other is a non-leaf subclass, establishing a one-to-many or many-to-one mapping relation according to the method in the step 3;

and updating the sets N and O, and removing the classes with the established mapping relation.

The semantic similarity between the new and old classes is mainly calculated based on the class name and the name of the parent class. Setting new and old class names as Cn and Co respectively, obtaining word sets Wn and Wo after removing stop words after word segmentation, and then

The class hierarchy similarity is calculated by the hierarchies of the new class and the old class (set as Ln and Lo): t is_{Class hierarchy}＝1.0/(|Lo-Ln|+1.0)。T_{Set of ancestor classes}、T_{Sibling collection}、T_{Set of children}The result is obtained by calculating the name character strings of all classes in the set according to the calculation formula of the semantic similarity.