CN113158654B

CN113158654B - Domain model extraction method and device and readable storage medium

Info

Publication number: CN113158654B
Application number: CN202011301741.5A
Authority: CN
Inventors: 杜佳诺; 连小利; 张莉; 赵子岩; 张航; 樊志强; 李华莹; 刘必欣; 张捷
Original assignee: Beihang University; CETC 15 Research Institute; Research Institute of War of PLA Academy of Military Science
Current assignee: Beihang University; CETC 15 Research Institute; Research Institute of War of PLA Academy of Military Science
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2022-04-29
Anticipated expiration: 2040-11-19
Also published as: CN113158654A

Abstract

The invention discloses a domain model extraction method, a domain model extraction device and a readable storage medium, wherein the method comprises the following steps: carrying out syntactic analysis on the requirement document, and determining the dependency relationship among the participles; determining semantic relations among concepts according to the dependency relations among the participles; and determining a corresponding domain model according to the semantic relation between the concepts. The method comprises the steps of determining the dependency relationship among the participles in the required document; determining semantic relations among concepts according to the dependency relations among the participles; and determining a corresponding domain model according to the semantic relation between the concepts, thereby improving the extraction accuracy of the domain model.

Description

Domain model extraction method and device and readable storage medium

Technical Field

The invention relates to the technical field of natural language identification, in particular to a method and a device for extracting a domain model and a readable storage medium.

Background

The domain model is a visual representation of important concepts and their relationships in the domain and is used to analyze how to meet the functional requirements of the system during the analysis phase of software development. The domain model may be represented using UML class diagrams, usage diagrams, ontologies, etc., as desired. The domain model is mainly composed of concepts, attributes and relationships. Concepts represent entities or events in the real world, the attributes of the concepts are logical data contained in the entities represented by the concepts, various relationships among the concepts represent semantic connections or interactive behaviors existing between the entities represented by the concepts, and common relationships include incidence relationships, aggregation relationships, inheritance relationships and the like.

The domain model provides structured knowledge about the underlying terms that make up the domain. Also, the design of systems, particularly in model-based development environments, is often modeled around domain models. The method has the advantages that the concepts and the relations among the concepts are correctly identified, the system architecture can be analyzed in the software development process, the development difficulty is reduced, the redundancy of codes is reduced, and the problems of inconsistency, incompleteness and the like of the analysis requirements of developers can be solved. When a developer builds a domain model, the developer needs to repeatedly check the requirement document, ensure that the built domain model is consistent with the requirement, and ensure that all concepts and relationships related to the requirement are contained in the domain model. For large applications, it is a very difficult task to manually build a domain model.

Disclosure of Invention

The embodiment of the invention provides a method and a device for extracting a domain model and a readable storage medium, which are used for improving the accuracy of extracting the domain model.

In a first aspect, an embodiment of the present invention provides a domain model extraction method, including:

carrying out syntactic analysis on the requirement document, and determining the dependency relationship among the participles;

determining semantic relations among concepts according to the dependency relations among the participles;

and determining a corresponding domain model according to the semantic relation between the concepts.

Optionally, the parsing the requirement document includes:

decomposing the requirement document to obtain corresponding word segmentation;

performing part-of-speech tagging based on the participles, and determining corresponding participle types according to part-of-speech tagging results;

determining a dependency relationship between the participles based on the participle types.

Optionally, after determining the corresponding word segmentation type according to the part-of-speech tagging result, the method further includes:

cleaning the word segmentation;

extracting word segmentation word stems in the cleaning result;

and restoring the word stem.

Optionally, determining a semantic relationship between concepts according to the dependency relationship between the participles includes:

traversing noun phrases in the participles, and determining dependency relationships between the phrases and words and between the phrases;

semantic relationships between concepts are extracted from the dependencies between phrases and words and between phrases.

Optionally, traversing the noun phrases, derivative phrases and dependencies between words and phrases in the participle includes:

if the target node corresponding to the dependency relationship taking the word in the current noun phrase as the source node falls into the current noun phrase, not deriving the current noun phrase;

if the target node corresponding to the dependency relationship taking the word in the current noun phrase as the source node falls outside the current noun phrase, deriving the current noun phrase.

Optionally, deriving the current noun phrase includes: if the derived words are source node words in noun phrases except the current noun phrase, deriving the dependency relationship between the obtained phrases, otherwise deriving the dependency relationship between the obtained phrases and the words.

Optionally, extracting semantic relationships between concepts according to the dependency relationships between phrases and words and between phrases, including:

extracting association relations among concepts according to the dependency relations among phrases and words and among phrases and according to source nodes corresponding to different syntactic structures; and the number of the first and second groups,

and matching the phrases and the dependency relations between the words and between the phrases according to a preset word structure, and identifying an aggregation relation, a cardinal number relation and an attribute relation between concepts.

Optionally, determining a corresponding domain model according to the association relationship between the concepts includes:

traversing boundary concepts in the association relationship between the concepts;

correcting the incidence relation of the boundary concepts matched with the preset field in the boundary concepts;

wherein the boundary concept is that only one other concept has a semantic relationship with the boundary concept.

In a second aspect, an embodiment of the present invention provides a domain model extraction apparatus, including:

the analysis unit is used for carrying out syntactic analysis on the requirement document and determining the dependency relationship among the participles;

the relation determining unit is used for determining semantic relation between concepts according to the dependency relation between the participles;

and the domain model determining unit is used for determining a corresponding domain model according to the semantic relation between the concepts.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the foregoing domain model extraction method.

The embodiment of the invention determines the dependency relationship among the participles in the requirement document; determining semantic relations among concepts according to the dependency relations among the participles; and determining the corresponding domain model according to the semantic relation between the concepts, thereby improving the extraction accuracy of the domain model and obtaining the positive technical effect.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a first embodiment of the present invention;

FIG. 2 is a flowchart of a syntax analysis according to a first embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

A first embodiment of the present invention provides a domain model extraction method, as shown in fig. 1, including the following specific steps:

s101, performing syntactic analysis on the requirement document, and determining the dependency relationship among the participles;

s102, determining semantic relations among concepts according to the dependency relations among the participles;

s103, determining a corresponding domain model according to the semantic relation between the concepts.

The embodiment of the invention determines the dependency relationship among the participles in the requirement document; determining semantic relations among concepts according to the dependency relations among the participles; and determining a corresponding domain model according to the semantic relation between the concepts, thereby improving the extraction accuracy of the domain model.

Optionally, the parsing the requirement document includes:

decomposing the requirement document to obtain corresponding word segmentation;

Specifically, in this embodiment, the parsing includes preprocessing the requirement statement, including word segmentation, sentence segmentation, part-of-speech tagging, phrase structure analysis, and dependency parsing. In this embodiment, a main flow of parsing the input requirement document is shown in fig. 2, and includes the following steps:

sentence splitting: the input text is divided into individual sentences.

Word segmentation: the input sentence is divided into individual symbols. A symbol may be a word, a number, a punctuation, or a space.

Part of speech tagging: and marking the part of speech of the symbols obtained by the word segmentation device, such as noun (NN), Verb (VB), adjective (JJ), preposition (IN), article (DT), conjunctive (CC) and the like.

Phrase structure analysis: presume the type that each structural unit IN the sentence belongs to, such as Noun Phrase (NP), Verb Phrase (VP), Preposition Phrase (PP), Verb (VB), article (DT), preposition (IN), etc.

Dependency parsing: and analyzing to obtain grammatical relations among independent words in the sentence, wherein the grammatical relations are represented by dependency relations. Dependency parsing inputs are sentences and outputs a directed acyclic graph consisting of relational triplets, e.g., as represented by the triplet < word, dependency class, word >. According to the specification of the international dependency definition framework Universal Dependencies, the dependency categories in this embodiment mainly include: noun subject relation (nsubj), passive noun subject relation (nsubjass), direct object relation (dobj), adjective-form modifier (amod), nominal modifier (nmod), clause modifier of noun (acl), relational clause modifier (acl: relcl), and the like.

Where the nominal modifier (nmod) represents the prepositional phrase structure in the sentence. Clause modifiers (acl) of nouns represent complement structures in verb indefinite or participle form; the relational clause modifier (acl: relcl) represents a clause modification structure.

cleaning the word segmentation;

extracting word segmentation word stems in the cleaning result;

and restoring the word stem.

In this embodiment, after obtaining the word segmentation result, the parsing further includes removing stop words: stop words are words that frequently occur in text and do not have a specific meaning, such as "a", "the", "any", etc.

Stem extraction and morphology reduction: the complex form of noun, the participle form of verb, the form of adjective-adverb and the like are converted into the original forms of these words.

And extracting the phrases and verbs of the atomic nouns to prepare for further extracting the concepts and the relations of the domain model.

Specifically, in this embodiment, based on the word segmentation obtained by the foregoing syntax analysis, the dependency relationship between the words obtained by the syntax analysis is further derived to obtain the phrase-level dependency relationship. Phrase-level dependencies may be represented as a relational triple < phrase, dependency type, phrase > or < phrase, dependency type, word >.

Pseudo code of the dependency derivation algorithm employed in the present embodiment is shown in table 1.

Table 1 dependency derivation algorithm

In this embodiment, the dependency derivation algorithm inputs all words, noun phrases, and dependencies among words obtained by parsing, and outputs dependencies among phrases and between phrases and words, and the specific process includes:

all noun phrases NP in the requirement document are examined:

token for each word in noun phrase NP₁: if a dependency dep (token) is started by taking the word as a source node₁,token₂) The target node of (2) still falls within the noun phrase, then the dependency is not derived.

If the target node of the dependency falls outside the noun phrase, then derive the dependency dep:

if token₂Is another noun phrase NP₂The derived dependency dep is dep (NP, NP)₂) Otherwise, deriving the dependency relationship dep as dep (NP, token)₂)。

Thereby determining the dependency relationships between phrases and words and between phrases.

Specifically, in this embodiment, the semantic relationships between the concepts include association relationships, aggregation relationships, cardinality relationships, and attribute relationships. The association relation comprises a direct relation and an indirect relation, wherein the direct relation represents the relation that the concept and the concept are directly connected and represented by a verb or verb phrase (including a participle form or an indefinite form of the verb or verb phrase) or a preposition; the indirect relationship is the transfer of the direct relationship, and if there is a direct relationship between the concept A and the concept B and a direct relationship between the concept B and the concept C, there is an indirect relationship between the concept A and the concept C.

Based on the embodiment, extracting association relations among concepts according to source nodes corresponding to different syntactic structures according to dependency relations between phrases and words and between phrases includes: firstly, direct relations among the concepts are identified, and then indirect relations among the concepts are derived according to the direct relations, so that all incidence relations are obtained.

Specifically, extracting association relations between concepts according to source nodes corresponding to different syntactic structures according to dependency relations between phrases and words and between phrases, including:

regarding the direct relation represented by the structure of the subject-predicate object, the subject is used as the source concept of the relation, the object is used as the target concept of the relation, and the predicate-predicate object is used as the content of the relation.

For the relationship of the main subject and the predicate in the relational clause, according to acl: relcl dependency relationship, a noun phrase indicated by the subject that or which in the relational clause is found to be used as a source concept of the relationship, and an object and a predicate in the clause are respectively used as a target concept and content of the relationship.

For direct relationships represented by prepositional phrase structures, the nominal part-of-speech modifier (nmod) is used for extraction, and the pseudo-code of the extraction algorithm is shown in table 2.

TABLE 2 preposition phrase extraction Algorithm pseudo-code

Taking a set of all atomic noun phrases and verbs as input, checking whether each noun phrase or verb is a source node of an nmod dependency.

If the source node of the nmod dependency is a noun phrase, the noun phrase of the nmod dependency source node is used as the source concept of the relation, the noun phrase of the target node is used as the target concept of the relation, and a preposition is used as the content of the relation.

If the source node of the nmod dependency is a verb, the direct object of the verb is used as the source concept of the relation, the noun phrase of the nmod dependency target node is used as the target concept of the relation, and the preposition is used as the content of the relation.

For the direct relationships represented by the verbalized complement structure, extraction is performed using clause modifiers (acl) of nouns, and the pseudo-code of the extraction algorithm is shown in table 3.

TABLE 3 verbalization anaglyph extraction Algorithm pseudocode

Using the collection of all atomic noun phrases as input, check if each noun phrase is the source node of an acl dependency.

If the noun phrase is the source node of an acl dependency and the acl dependency target node is a transitive verb or verb phrase, the noun phrase of the acl dependency source node is taken as the source concept of the relationship and the object followed by the dependency target node verb or verb phrase is taken as the target concept of the relationship, the verb or verb phrase being the content of the relationship.

And deducing indirect relations among the concepts according to the extracted direct relations, thereby obtaining all association relations among the concepts.

Embodiments of identifying aggregation relationships may include:

the aggregation relationship is expressed for word structures such as "continain", "include", "type of", and all lattice forms of nouns. Taking the statement "a contacts B" or "a's B" as an example, the aggregation relationship with the source concept being B and the target concept being a can be extracted.

Embodiments of identifying cardinality relationships may include:

the singular and plural forms of indefinite articles, ordinals, nouns in a word structure represent cardinal relationships.

If both the source concept and the target concept of an associative relationship are singular, the relationship is a one-to-one relationship.

If the source concept and the target concept of an associative relationship are both complex numbers, the relationship is a many-to-many relationship.

If the source concept of an associative relationship is singular and the target concept is plural, the relationship is a one-to-many relationship.

If the source concept of an associative relationship is plural and the target concept is singular, the relationship is a many-to-one relationship.

If a source concept or a target concept of an associative relationship is preceded by an explicit numerical modification, the number represents a cardinality relationship.

The specific implementation of the identification of the attribute relationship may include:

word structures in the form of "identified by", "retrieved by", etc. may represent attributes. Taking the statement "a is identified by B" as an example, B can be extracted as an attribute of the concept a.

Adjectives that modify a concept represent the attributes of the concept and are embodied in natural language as idioms or as a master system structure. A phrase represents the property of the noun phrase it modifies, and a phrase represents the property of a subject.

A default verb with adverb or complement modifiers represents an attribute. Taking The statement "The train arrives in The moving at 10 am." as an example, it can be inferred by The short verb "arrives" together with The following complement that The concept "train" should have an attribute "arrival time".

After the semantic relation among the concepts is obtained, the aggregation relation, the cardinal number relation and the attribute relation can be further distinguished, so that the identification accuracy of the three is improved.

The boundary concept in this embodiment means that if there is and only one other concept that has a relationship with this concept, this concept is called a boundary concept. And for the obtained semantic relations among the concepts, checking all the incidence relations containing the boundary concepts, and if the content of the incidence relations can match structures such as 'include in', 'including', 'containing' and the like, correcting the incidence relations into aggregation relations or attributes. For example, all boundary concepts of the domain model and the association relationship connecting the boundary concepts may be checked based on the existing domain model extraction result. If the specific content of the association can be matched with the similar meaning words of the mode representing the aggregation relationship, such as "contact", "include", etc., the association relationship is modified into the aggregation relationship or the attribute.

Compared with the extraction result of the field modeling expert, the method can extract 95% of the relation in the requirement document.

In conclusion, the method expands the extraction rule of the domain model, introduces various new dependency relationships and grammar structures for extracting the domain model, and can more comprehensively and accurately extract the information represented by the preposition phrase structure and the complement structure. The method also provides a boundary concept in the field model, and provides a method for checking the incidence relation containing the boundary concept, so that the accuracy of the incidence relation, the aggregation relation and the attribute identification can be improved.

Example two

A second embodiment of the present invention provides a domain model extraction apparatus, as shown in fig. 3, including:

EXAMPLE III

A third embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the domain model extraction method of the first embodiment.

In an alternative embodiment, the computer program when executed by a processor implements:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A domain model extraction method is characterized by comprising the following steps:

determining a corresponding domain model according to the semantic relation between the concepts;

determining semantic relations between concepts according to the dependency relations between the participles, including:

extracting semantic relations between concepts according to the dependency relations between phrases and words and between phrases;

traversing noun phrases in the participle, derivative phrases and dependencies between words and phrases, including:

if the target node corresponding to the dependency relationship taking the word in the current noun phrase as the source node falls outside the current noun phrase, deriving the current noun phrase;

deriving current noun phrases, including: if the derived words are source node words in noun phrases except the current noun phrase, deriving to obtain the dependency relationship between the phrases, otherwise deriving to obtain the dependency relationship between the phrases and the words;

extracting semantic relations between concepts according to the dependency relations between the phrases and the words and between the phrases, comprising:

matching the phrases and the dependency relationships among the words and among the phrases according to a preset word structure, and identifying an aggregation relationship, a cardinal number relationship and an attribute relationship among concepts;

determining a corresponding domain model according to the incidence relation, comprising:

2. The domain model extraction method of claim 1, wherein parsing the requirements document comprises:

decomposing the requirement document to obtain corresponding word segmentation;

3. The method of extracting a domain model according to claim 2, wherein after determining the corresponding segmentation type according to the part-of-speech tagging result, further comprising:

cleaning the word segmentation;

extracting word segmentation word stems in the cleaning result;

and restoring the word stem.

4. A domain model extraction device, comprising:

the domain model determining unit is used for determining a corresponding domain model according to the semantic relation between the concepts;

deriving the current noun phrase, including: if the derived words are source node words in noun phrases except the current noun phrase, deriving to obtain the dependency relationship between the phrases, otherwise deriving to obtain the dependency relationship between the phrases and the words;

5. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.