CN110717014A

CN110717014A - Ontology knowledge base dynamic construction method

Info

Publication number: CN110717014A
Application number: CN201910866024.8A
Authority: CN
Inventors: 郭新龙
Original assignee: Beijing Sihai Xintong Technology Co Ltd
Current assignee: Beijing Sihai Xintong Technology Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2020-01-21
Anticipated expiration: 2039-09-12
Also published as: CN110717014B

Abstract

The invention provides a dynamic construction method of an ontology knowledge base, which comprises the following steps: converting natural language participles to be processed into a grammar structure; converting the converted grammar structure into a predicate-argument structure on the basis of the converted grammar structure; and establishing a mapping relation between the converted predicate-argument structure and the final semantic structure through a preset ontology mapping algorithm based on a pre-constructed background knowledge ontology, and finishing the extraction of semantic information. The ontology knowledge base dynamic construction method introduces a predicate-argument structure (PA structure) as an intermediate form of a grammar structure and a semantic structure on the basis of original semantic information extraction, thereby not only greatly reducing the complexity of a program, but also providing possibility for various mapping modes of a system; and the designed mapping algorithm reduces the query of a knowledge base and the comparison of word similarity as much as possible, and reduces the time complexity of the algorithm.

Description

Ontology knowledge base dynamic construction method

Technical Field

The invention relates to the technical field of semantic extraction, in particular to a dynamic construction method of an ontology knowledge base.

Background

In recent years, with the spread of computers and the development of the internet, more and more information is beginning to appear in the form of electronic documents. The massive data and information have very important research value, and how to make the computer effectively utilize the resources and data is a very urgent task.

But the information on the network is very sporadic, non-spec, or even inaccurate. In order for a computer to use this information, it is often necessary to manually process the information, to remove inaccurate information, to summarize and normalize the information, and to present it in a computer-usable format so that it can be used by the computer. However, the method is time-consuming and labor-consuming, and can not keep pace with the growth speed of internet information, so that the research on new knowledge construction technology is quite in line with the development and the requirement of the current times.

Based on this need, information extraction and dynamic knowledge base construction techniques have been proposed. The knowledge base oriented to the specific field is dynamically constructed on the basis of a general high-quality knowledge base (background knowledge) and a semantic information extraction technology, so that people can extract information more efficiently, and people are liberated from complicated manual construction. However, the existing information extraction and dynamic knowledge base construction technology still has the problem of excessively complex program.

Disclosure of Invention

The invention provides a dynamic construction method of an ontology knowledge base aiming at the problem that manual construction of the knowledge base in the natural language processing field is time-consuming and labor-consuming, and a predicate-argument structure (PA structure) is introduced as an intermediate form of a grammar structure and a semantic structure on the basis of original semantic information extraction, so that a complex task is divided into two relatively simple tasks, the complexity of a program is greatly reduced, and meanwhile, the possibility is provided for various mapping modes of a system.

The semantic structure is based on ontology theory and is represented by a Resource Description Framework (RDF).

First, the present invention constructs some necessary background knowledge including event class, argument class and semantic role class for the characteristics of the PA structure. In order to match all PA structures as much as possible, two event classes, namely a general event class and a special event class, are designed, and two mapping ideas are provided for mapping the PA structures and the background knowledge ontology.

Meanwhile, the invention designs and realizes a word similarity calculation algorithm on the basis of a synonym dictionary of synonym forest expansion edition, provides a mapping algorithm of a PA structure and a background knowledge ontology on the basis of a semantic similarity theory, reduces the query of a knowledge base and the comparison of word similarities as far as possible, and reduces the time complexity of the algorithm.

Then, the invention realizes the storage management of ontology knowledge by utilizing an open-source Eclipse RDF4J (great name Sesame) framework, and encapsulates the access operation of the knowledge base on the basis of an API (application program interface) provided by an RDF4J development kit, thereby simplifying the development of the knowledge base.

Finally, a prototype system is designed and realized, the effectiveness of the mapping algorithm is verified through a simulation experiment, and meanwhile, the influence of the semantic similarity threshold value in the mapping algorithm on the mapping result is explored.

Specifically, the ontology knowledge base dynamic construction method comprises the following steps:

converting natural language participles to be processed into a grammar structure;

converting the converted grammar structure into a predicate-argument structure on the basis of the converted grammar structure;

and establishing a mapping relation between the converted predicate-argument structure and the final semantic structure through a preset ontology mapping algorithm based on a pre-constructed background knowledge ontology, and finishing the extraction of semantic information.

Further, the background knowledge ontology comprises an event class ontology, a semantic role class ontology and an argument class ontology; the event type ontology corresponds to a predicate in a predicate-argument structure; the argument class ontology corresponds to arguments in the predicate-argument structure.

Further, the event class ontology includes two event classes, a general event class and a special event class.

Further, the semantic structure is based on ontology theory and is represented by a Resource Description Framework (RDF).

Further, after the extraction of the semantic information is completed, the dynamic ontology knowledge base construction method further includes: utilizing an Eclipse RDF4J framework to realize the storage management of semantic information and constructing a body knowledge base; and encapsulates the access operation of the ontology knowledge base based on the API interface provided by the RDF4J development kit.

Furthermore, each class in the background knowledge ontology has a text attribute, and the attribute value of each class is a description character string corresponding to the class; the ontology mapping algorithm comprises a predicate matching algorithm, a semantic role matching algorithm and an argument matching algorithm.

Further, the predicate matching algorithm comprises the following steps:

step one, searching an event class with a text attribute value as a predicate in the event class in a background knowledge body, if so, matching the event class, and if not, continuing the step two;

step two, searching synonyms of event classes with text attribute values as predicates in the event classes in the background knowledge body, if yes, matching the synonyms to the event classes, and if not, continuing the step three;

step three, calculating the similarity of the predicates and the event classes in each background knowledge body by adopting a preset word similarity algorithm, and adding the event classes and the similarity into a set to be matched if the similarity of the predicates and the event classes is greater than a preset threshold value for each event class; and finally, taking the event class with the maximum similarity from the set to be matched, if the similarity at the moment is greater than a preset threshold value, the matching is successful, otherwise, the matching is failed, and adopting a construction mode based on the general event class.

Further, the semantic role matching algorithm comprises:

if the predicate matching fails, outputting the value of the semantic role according to the one-to-one correspondence relationship between the semantic role of the predicate-argument structure and the attribute of the background knowledge ontology;

and if the predicate matching is successful, searching the predicate attribute of which the father attribute is the semantic role, and outputting the predicate attribute.

Further, the argument matching algorithm comprises:

step one, searching a argument class with a text attribute value as a predicate in an argument class in a background knowledge body, if so, matching the argument class, and if not, continuing the step two;

step two, searching synonyms of the argument class with the text attribute value as the argument in the argument class in the background knowledge body, if yes, matching the synonyms to the argument class, and if not, continuing the step three;

step three, calculating the similarity of the predicate and the argument classes in each background knowledge ontology by adopting a preset word similarity algorithm, and adding the argument classes and the similarity into a set to be matched if the similarity of each argument class and the predicate is greater than a preset threshold value; and finally, taking out the argument class with the maximum similarity from the set to be matched, if the similarity at the moment is greater than a preset threshold value, the matching is successful, otherwise, the matching is failed, and expressing the argument by using an argument character string.

Further, the preset word similarity algorithm is a similarity algorithm based on a synonym forest expansion version, and comprises the following steps:

judging whether two words A, B to be compared are in the same tree, if A, B is not in the same tree, the similarity Sim (a, B) of A, B is calculated by the following formula:

Sim(A,B)＝f

wherein f is a preset constant;

if A, B is on the same tree, further determine A, B at which layer the branch occurs, and when the number of branch layers is not greater than 5, determine the value of the predetermined coefficient λ according to the number of branch layers, and substitute it into the following formula to calculate the similarity Sim (a, B) of A, B:

wherein n is the total number of nodes in the branch layer, and k is the distance between two branches;

when the branch level is greater than 5, the two words A, B are in the same row of the synonym dictionary, and the similarity of the two words needs to be calculated according to the symbol of the eighth bit; when the symbol is 1, taking Sim (A, B) as 1; when the symbol is # the symbol Sim (a, B) ═ e is taken, where e is a preset constant.

The technical scheme of the invention has the following beneficial effects:

the method comprises the steps of converting natural language participles to be processed into a grammar structure; converting the converted grammar structure into a predicate-argument structure on the basis of the converted grammar structure; and establishing a mapping relation between the converted predicate-argument structure and the final semantic structure through a preset ontology mapping algorithm based on a pre-constructed background knowledge ontology, and finishing the extraction of semantic information. On the basis of the original semantic information extraction, a predicate-argument structure (PA structure) is introduced as an intermediate form of a grammar structure and a semantic structure, so that the program complexity is greatly reduced, and the possibility is provided for various mapping modes of a system; the mapping algorithm designed by the invention reduces the query of the knowledge base and the comparison of word similarity as much as possible, and reduces the time complexity of the algorithm.

Drawings

FIG. 1 is a flow chart of the semantic information extraction process of the present invention;

FIG. 2 is a flow chart of the similarity calculation algorithm of the present invention;

FIG. 3 is a schematic diagram of a PA structure;

FIG. 4 is a flow chart of a predicate matching algorithm of the present invention;

FIG. 5 is a flow chart of the semantic role matching algorithm of the present invention;

FIG. 6 is a diagram of a prototype system class diagram of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a dynamic ontology knowledge base construction method aiming at the problems that the manual construction of a knowledge base in the field of natural language processing is time-consuming and labor-consuming in the prior art.

The key problems and innovation points solved by the embodiment of the invention comprise the following three aspects:

the conversion of semantic structures by natural language is very complex due to the flexibility and ambiguity of natural language. In order to solve the problem, the invention uses a semantic role labeling technology, and a product predicate-argument structure (PA structure) labeled by the semantic role is used as an intermediate medium between a grammar structure and a semantic structure, so that the whole process is simplified, and the flexibility of the system is improved. The semantic role labeling technology is a shallow semantic analysis method, simplifies a deep semantic analysis method, and under many conditions, results of the semantic role labeling technology are more accurate than deep semantic analysis.

The conversion of PA structures into semantic structures is the difficult and central point of the overall scheme. To solve the problem, the invention firstly establishes a set of background knowledge, and the conversion from the PA structure to the semantic structure is carried out on the basis of the background knowledge. Therefore, the problem is converted into the mapping of the PA structure and the background knowledge, the PA structure is a product of natural language marking, predicates and arguments matched with the PA structure are character strings in essence, and description character strings are added to each item of the background knowledge, so that the whole process is converted into the matching of the character strings, and the concept of semantic similarity is researched. Finally, based on the concept of semantic similarity, a matching algorithm of character strings is realized, the algorithm reduces the query of a knowledge base and the comparison of word similarity as much as possible, and the time complexity is optimized.

The establishment of the ontology knowledge base is also the key problem of the scheme of the invention. Because the ontology knowledge base of the invention needs to be shared in the laboratory, users in the laboratory can inquire and modify the knowledge base within the permission range. This puts high demands on the knowledge base, not only needs high efficiency and reliability, but also needs to provide the management function of the network-based access mechanism and the user authority. Since the ontology of the present invention is represented by RDF, the ontology repository of the present invention is essentially an RDF management system. The invention realizes the functions by using the Eclipse RDF4J framework, is used in a laboratory and has good effect.

In order to solve the above technical problem, the method for dynamically constructing an ontology knowledge base of the embodiment includes:

The background knowledge ontology comprises an event ontology, a semantic role ontology and an argument ontology; the event type ontology corresponds to a predicate in a predicate-argument structure; the argument class ontology corresponds to arguments in the predicate-argument structure. The event class body comprises two event classes, namely a general event class and a special event class. The semantic structure is represented by a resource description framework RDF on the basis of ontology theory. Furthermore, each class in the background knowledge ontology has a text attribute, and the attribute value of each class is a description character string corresponding to the class; the ontology mapping algorithm comprises a predicate matching algorithm, a semantic role matching algorithm and an argument matching algorithm.

The following describes the specific implementation process of this embodiment:

1. design of ontology knowledge base for geographic field

1.1 background knowledge

The invention adopts semantic information extraction based on background knowledge. The method has the advantages that the background knowledge plays a guiding role when the PA structure and the semantic structure are converted, the background knowledge comprises the PA structure predicates, the semantic roles and the semantic definitions corresponding to the arguments, and the PA structure items and the background knowledge items are mapped during conversion, so that the conversion from the PA to the semantic structure is completed.

Because the invention adopts the ontology to describe the semantic structure, the background knowledge is also expressed by the ontology, which can be referred to as the background knowledge ontology for short.

1.2 design of background ontology

The invention adds a predicate-argument structure (PA structure) as an intermediate structure in the conversion process from natural language to semantic structure. Namely, the knowledge construction of the present invention is a dynamic knowledge construction based on the PA structure. Therefore, a set of background ontology needs to be designed according to the characteristics of the PA structure, and the set of background ontology must satisfy the following three conditions.

1) Facilitating mapping of PA structure to knowledge structure;

2) can cover almost all PA structures;

3) the optimization of knowledge representation in the future can be facilitated.

The PA structure consists of predicates, semantic roles and arguments, and according to the characteristics of the PA structure, the ontology is divided into an event ontology, a semantic role ontology and an argument ontology.

The event class ontologies correspond to predicates in the PA structure, and the predicates describe verb concepts. Concepts such as purchase, go to … …, sing. The event class body has a general event class, and other events are all subclasses of the general event class.

The argument class ontology corresponds to the argument in the PA structure, and the argument class ontology describes a part-of-speech concept, such as concepts of association, apple and today.

According to the semantic features of the PA structure, the argument class ontology can be divided into some subclasses: time, place, person pronouns, direction, frequency, degree, etc., which may be designed according to the domain to which the background ontology relates.

A generic event class is first defined.

The predicate P is then defined along with its 6 core semantic roles. These semantics are all attributes of the event class.

The following are 15 additional semantics.

Semantic constraint relationships are defined, and LOC semantics is taken as an example. LOC represents the "location" semantics of predicates, and thus the value range of an LOC attribute is specified herein as a "location" class.

In this way, the above location events can be constructed based on the background knowledge defined above.

This approach is straightforward and can cover all PA structures. Of course, this is only a rudimentary solution when no corresponding specific event class is found.

For the purchase event mentioned above, if there is a specific "meeting event" class in the knowledge base, it is a subclass of general events, and it is referred to herein as a specific event with respect to the general event defined above, and it is defined specifically according to a specific event.

For example, the "meeting event" includes two attributes, namely "meeting party" and "met party", and the defined domain is the "meeting event" class and the value domain is the "name" class. The relevant definitions are as follows.

Thus, the "meeting" event mentioned above can be expressed as:

compared with a general event type representation mode, the method can better reflect the characteristics of a specific event, and the final semantic structure which is expected to be generated by the method is also provided.

1.3 design of ontology repository

The concept of the knowledge base (knowledgbasee) comes from the field of artificial intelligence. A knowledge base is a database that stores and processes knowledge. The knowledge base is characterized in that the knowledge is effectively processed according to a certain purpose and then is stored according to a structured mode, so that the knowledge is easier to operate and utilize. It is an important way to organize knowledge.

Since the ontology proposed by the present invention is expressed by RDF/RDFS, the ontology repository proposed by the present invention is actually a database capable of storing and retrieving RDF/RDFS.

Many storage management systems for relatively mature RDF data sets have appeared, and they can be classified into a storage manner based on memory, a storage manner based on files, and a storage manner based on relational databases according to their implementation principles.

1) Memory-based storage mode

The method comprises the steps of firstly loading all data of a file system into a memory at one time, operating RDF data at the moment is actually the operation on a memory data structure, and finally persisting the memory data structure into the file system through writing in a file. The method has high processing speed and convenient realization, but has limited scale and cannot realize the oversized RDF data storage.

Eclipse RDF4J (formerly Sesame) currently implements this storage. The method usually adopts a data compression mode to save memory space so as to realize larger RDF data storage.

2) File-based storage mode

This approach is essentially a database specifically designed for the data characteristics of RDF. Such a database treats an RDF document as a basic logical storage unit, equivalent to a table in a relational database. Each statement in the RDF document corresponds to a row in a relational database table. Multiple DRF documents can form a collection, equivalent to a database in a relational database. This process was carried out with both RDF4J and Kowari.

The method then queries a set for a statement as follows.

Step 1: and judging the document where the query statement is located according to the index.

Step 2: the document is queried to find the statement and output the structure, similar to the way memory-based RDF is stored.

If an RDF document is stored as a file, the query on the file becomes slow and inefficient when the document is very large, so this approach generally divides an RDF document into a plurality of files to be stored on the hard disk, and indexes the files to speed up the query.

3) Storage mode based on relational database

The method uses a mature relational database to store RDF data. Many RDF storage systems are implemented in this manner, such as Jena, Rstar, and 3 store. This approach can use relational database organizational management, transaction control, and relational database SQL statements to shield complex underlying operations for RDF query and operation implementation, so RDF storage based on relational databases is a good approach, and its implementation is taught below.

Definition 1RDF dataset may be represented as R ═ (R)_N,R_S,R_T) Wherein R is_NRepresenting a defined namespace, R_SAnd the representation resource values comprise a class resource set C, an instance resource set O, an attribute resource set P and a character constant set L. R_TRepresenting statement value, by resource value R_SA set of presentation triples defined for the basis.

By defining 1, it can be derived that the RDF data set contains a namespace, a resource, a literal constant, and R_SA presentation triple defined for the basis. Before describing the RDF storage method based on the relational database, the data needs to be tabulated respectively:

RDFNramespace stores all namespaces (namespaces), the column descriptions of which are shown in Table 1.

Table 1 RDFNameSpace table list

The RDFREResource table stores all resources (resources), and the column description is shown in Table 2.

TABLE 2 RDFRESOURCE TABLE list description

The RDFLiteral table stores all the literal constants (literals), the column description of which is shown in Table 3.

TABLE 3 RDFLiteral Table Listing

The RDFSstatement table holds all statements (statements), the column description of which is shown in Table 4.

TABLE 4 RDFSstatement

Given RDF data R ═ (R)_N,R_S,R_T) The following rules illustrate the storage method of RDF.

Rule 1 store namespace for given RDF data R ═ (R)_N,R_S,R_T) Each name space

Write to RDFNameSpace TABLE (TABLE RDFNameSpace).

Rule 2 storage class resource for a given RDF data R ═ (R)_N,R_S,R_T) Of the classResource collection

Each class element C ∈ C is written into the rdfresh table.

Rule 3 stores instance resources for a given RDF data R ═ (R)_N,R_S,R_T) Property resource set of

Each attribute O e O is written to the rdfresh table.

Rule 4 stores that predicate (attribute) for a given RDF data R ═ R (R)_N,R_S,R_T) Property resource set of

Each attribute P e P is written into the rdfresh table.

Rule 5 stores literal constants for a given RDF data R ═ (R)_N,R_S,R_T) Set of its literal constants

Each class element L e L is written into the rdflirtiral table.

Rule 6 stores statements that for a given RDF data R ═ (RN, RS, RT), each statement is to be read

And writing into an RDFSstatement table.

2. Ontology knowledge base dynamic construction technology

It is very difficult to have a computer understand because of the flexibility and ambiguity of natural language. The nature of computer natural language processing is to convert natural language into a computer-readable formal semantic structure.

Most of the current natural language processing methods are based on grammatical structures. Through the laborious research of the ancestors of countless linguists, human beings have systematically induced the syntactic phenomenon of natural language nowadays, which brings great convenience to language learners. The grammar structure is undoubtedly a relatively normalized structure and is also the basis for understanding natural languages. Therefore, the natural language workers take the grammar structure as an intermediate structure between the natural language and the semantic structure, the processing process of the natural language is simplified very well, and the natural language processing mode based on the grammar structure also achieves very good effect in practical application. However, the grammar structure is only a description of the natural language form, the gap between semantic structures is large, the conversion to semantic structures only through the grammar structure is very complicated, and the result is not ideal.

Aiming at the defects of a natural language processing mode based on a syntactic structure, the invention adds an intermediate medium, namely a predicate-argument structure (hereinafter referred to as PA structure) into the syntactic structure and the semantic structure, wherein the PA structure is a simplified semantic analysis mode and constructs a shallow semantic formalized structure by marking some components in a sentence as semantic roles of a given verb based on the syntactic structure. Compared with a grammar structure, the method is more similar to a semantic structure, is relatively simple in construction mode, and is very suitable for serving as the basis of a final grammar structure.

2.1 transformation of Natural language into PA Structure

The PA structure is a shallow semantic structure, which is between the syntactic structure and the semantic structure, so that the PA structure and the semantic structure can establish mapping relation more easily than the syntactic structure, which is the most advantage of the aforementioned shallow syntactic analysis-simplicity. The current semantic role labeling technology is relatively mature, so that semantic information extraction through a PA structure is very feasible. The method comprises the steps of firstly converting natural language segmentation into a grammar structure, then converting the grammar structure into a PA structure on the basis of the grammar structure, and then establishing a mapping relation between the PA structure and a final semantic structure through an ontology mapping algorithm to be introduced later so as to complete extraction of semantic information. The extraction process is shown in figure 1.

2.2 term similarity calculation based on synonym forest expansion edition

There are two main types of semantic similarity calculation: tree-based semantic similarity algorithms and corpus-based semantic similarity algorithms. The corpus-based method is relatively objective, but depends on a corpus used for training, and is greatly interfered by data sparseness and data noise. The tree-based semantic similarity method is simpler than the corpus, but it relies on a well-constructed tree structure. And such tree structures are usually manually defined and are easily affected by the subjectivity of the creator of the tree structure.

The key point of the knowledge ontology mapping lies in synonym query, synonym matching directly influences the correctness of the knowledge ontology mapping, and related words can be well found in a word similarity calculation mode based on a corpus, but the effect of searching and judging synonyms is not particularly ideal. Therefore, in order to better match synonyms, the method adopts a semantic similarity calculation mode based on the tree, and judges the synonyms more accurately.

The calculation method of word similarity based on the tree is generally calculated according to a semantic dictionary of a manually constructed tree structure, and the structure of the calculation method depends on the accuracy of the manually constructed semantic dictionary. At present, for Chinese characters with a Hownet and a synonym forest expansion version, because the synonym forest expansion version is a pure synonym dictionary, the requirement of the text is just met, and the synonym forest expansion version is relatively simple, the calculation of the semantic similarity is carried out by selecting the synonym forest expansion version semantic dictionary.

1) Dictionary introduction of synonym forest expansion edition

The synonym forest is compiled in 1983 by the Meretrix foals and the Zhu-Yiming, and is the first Chinese synonym dictionary in China. The dictionary includes not only a synonym set of words, but also a set of a certain number of similar words, i.e., related words. In recent years, it has been held by universities and institutions to build synonym libraries for chinese, which have been widely used as a "synonym forest extension" dictionary redacted by the information retrieval laboratory of the university of harbin industries. The board contains nearly 7 ten thousand words, all arranged according to meanings, and is a synonymy dictionary.

Synonym forest expansion edition uses a tree hierarchy to organize words. The vocabulary is divided into three classes of large, medium and small, wherein the large class has 12, the medium class has 97, and the small class has 1400. There are many words in each subclass, which are divided into several word groups (paragraphs) according to the distance and relevance of the word senses. The words in each word group are further divided into a plurality of lines, and the words in the same line have the same word sense or strong correlation of the word senses.

Synonym forest provides 5 levels of coding, level 1 being indicated by capitalized English letters; level 2 is represented by lower case english letters; level 3 is represented by a binary decimal integer; level 4 is represented by capital English letters; level 5 is represented by a two-digit decimal integer. For example:

ad02D01 (Chinese people-Yanhuang offspring Tang)

Ae07A05# vegetable farmer Cotton farmer tea farmer tobacco farmer sugarcane farmer flower farmer pesticide forest farmer ginger farmer fisher mushroom farmer jujube farmer wheat farmer silkworm farmer fruit farmer melon farmer

Ad02F01@ extraterrestrial

The hierarchy and coding table is shown in table 5.

TABLE 5 coding List of synonym forest

Since some rows in level 5 are synonyms, some rows are related words, and some rows have only one word, the classification result needs to be specified, and specific 3 cases can be distinguished. These 3 cases are distinguished by using special symbols, so that there are 3 marks at the 8 th position, and the marks are equal and synonymous; "#" represents unequal and homogeneous, belonging to related words; "@" stands for self-enclosing, independent, it has neither synonyms nor related words in the dictionary.

For example, in the above examples, chinese, yanhuang offspring and tang are synonymous words, so the mark position is "═ and vegetable growers, cotton growers, tea growers, tobacco growers, sugarcane growers and flower growers are fine categories of farmers, so the mark position is" # "and represents the same category.

2) Algorithm design for calculating word similarity based on synonym forest expansion version synonym dictionary

From the above description of the synonym forest, it can be known that the synonym forest is a tree structure set with five levels of depth, and an 8-bit long code is used to represent the position of each meaning in the structure. The term similarity calculation based on the synonym forest expansion version is the calculation of semantic similarity based on trees.

Furthermore, it can be known that the computation of semantic similarity based on trees is related to the length of the path between two nodes and the depth of the node where it is located. Based on the thought, the invention designs an algorithm.

First, find out at which level of the synonym forest the two word nodes branch, i.e. at which level the numbers of the two words in the synonym forest start differently, e.g. Ad02D01 ═ and Ae07a05# branch at the level 2, Ad02D01 ═ and Ad02D02# branch at the level 5. Then multiplying the corresponding coefficient according to different layer numbers of the branches, and then multiplying by a control parameter (n-k +1)/n, wherein n is the total number of nodes of the branch layer, and k is the distance between the two branches. Thus, the originally calculated values corresponding to only a few points are refined, the result is accurately calculated, and the mathematical description of the algorithm is as follows.

Judging whether two words A, B to be compared are on the same tree, if A, B is not on the same tree, the similarity Sim (a, B) of A, B is f; wherein f is a preset constant; if A, B is on the same tree, further determine A, B at which layer the branch occurs, and when the number of branch layers is not greater than 5, determine the value of the predetermined coefficient λ according to the number of branch layers, and substitute it into the following formula to calculate the similarity Sim (a, B) of A, B:

Further, the algorithm is illustrated in tabular form as follows:

for two words A, B, the similarity is represented by Sim (a, B), and let a, B, c, d, e, f be constant.

If A, B are not on the same tree:

Sim(A,B)＝f (1)

if A, B are on the same tree:

in the second layer branch, the coefficient is a,

in the third level branch, the coefficient is b,

in the fourth layer branch, the coefficient is c,

in the fifth layer branch, the coefficient d,

for example, comparing an acquisition to a purchase, first look up the location of the two words in the synonym forest, respectively.

He03a02 ═ acquisition purchase-disconnection acquisition and purchase recovery

He03A01 is the purchase order purchase request sale market-in redemption wharf purchase order device purchase order purchase order

The code for acquisition is He03a02 and the number for purchase is He03a01, which are branched at the fifth level, so the acquisition and purchase are similar.

Wherein n is 5 and k is 1.

Of course this algorithm only considers branches and if two words A, B are in the same row of the synonym dictionary, then the similarity of the two words needs to be calculated from the symbol encoding the eighth bit. When the symbol is 1, taking Sim (A, B) as 1; when the symbol is # s, Sim (a, B) ═ e is taken.

For example, when He03a02 is coded and the eighth bit is coded, Sim (acquisition) is 1. Also, for example, african and asian, they both encode Ad02B04#, and the eighth bit encodes #, then Sim (african, asian) ═ e.

After many tests, the initial value of the number of layers was set to 0.64 a, 0.8 b, 0.9 c, 0.95 d, 0.5 e and 0.1 f, after manual evaluation.

Based on the design of the algorithm and its idea, a flow chart is shown in FIG. 2.

2.3 matching of words to ontologies

By the introduction, the invention firstly converts the text into a predicate-argument structure (PA structure), and then maps the PA structure into the structure of ontology knowledge. The PA structure is the result of labeling in natural language, so the predicate and various arguments of the PA structure are essentially individual words. The essence of the dynamic construction of ontology knowledge is the mapping between the PA structure and the background knowledge ontology, and the key is the mapping between words and ontology.

The above explains the design of the ontology, and the invention expands the background knowledge ontology for the convenience of mapping the character string and the background knowledge ontology. A text attribute is first defined.

The invention provides that each class has a text attribute, the attribute value is a description character string corresponding to the class, the character string can contain a plurality of synonyms which are separated by commas, for example, the definition of a person class A and a person class B.

Therefore, when the classes are matched, only the text attribute values of the classes need to be matched, and the mapping of the character strings and the ontologies is converted into the mapping of the words. For mapping of words, word similarity needs to be found first, and then a threshold is set, and if the threshold is larger than the threshold, the two words are considered to be matched. It can be seen that the value of the threshold is very important to the correctness of the mapping of the ontology.

2.4 mapping of PA Structure to background knowledge ontology

By the above introduction, the PA structure is composed of a predicate and several semantic roles, and can be represented as PA ═ p, Sp >. Where p denotes a predicate, which is usually a verb, and Sp denotes the semantic roles of the predicate p, each semantic role corresponding to an argument, denoted As. For example, an example sentence "person A meets person B in the people's hall", the predicate of the sentence is "meeting", the semantic roles include A0, A1 and LOC, wherein the semantic role A0 corresponds to the argument "person A", the semantic role A1 corresponds to the argument "person B", and the semantic role LOC corresponds to the argument "people's hall". The PA structure is a tree structure which takes the predicate p as a root node and takes semantic roles as child nodes, and each branch of the root node represents a different semantic role. As shown in fig. 3.

In the above, the present invention designs the background knowledge ontology, including the event class and the argument class, according to the characteristics of the PA structure. The semantic structure converted by the PA structure is represented by an RDFS format, an RDFS axiom is a triple structure, and RDF data composed of axioms can be represented as a labeled graph.

If each branch of the root node of the PA tree is considered as a triplet < s, p, o >, the semantic structure transformed according to the PA structure is a set of triplets. Wherein s of each triple corresponds to a predicate of the PA structure, o corresponds to an argument under each semantic role, and p corresponds to a semantic role of the PA structure.

By the above analysis, the mapping of PA structure to background knowledge ontology mainly comprises 3 tasks. Matching predicates, matching semantic roles and matching arguments.

1) Matching of predicates

The predicates correspond to event classes in the background knowledge ontology. Different from the traditional matching mode of words and ontologies, in order to enable the whole matching to be as efficient as possible, the query of a knowledge base and the comparison of word similarity need to be reduced as much as possible, and in order to achieve the purpose, the invention designs the following algorithm.

And searching the event class with the text attribute value as the predicate in the event class in the background knowledge ontology, if so, matching the event class, and if not, continuing the next step.

And searching synonyms of the event class with the text attribute value as the predicate in the event class in the background knowledge ontology, if the synonyms exist, matching the synonyms to the event class, and if the synonyms do not exist, continuing the next step.

And calculating the similarity of the predicate and the event class in each background knowledge ontology, and adding the event class and the similarity into the set to be matched if the similarity of the predicate and the event class is greater than a threshold value T for each event class. And finally, taking the event class with the maximum similarity from the set to be matched, if the similarity at the moment is greater than a threshold value T, the matching is successful, otherwise, the matching is failed, and adopting a construction mode based on the general event class. The algorithm flow is shown in fig. 4.

2) Matching of semantic roles

Semantic roles correspond to attributes in the background ontology.

If the structure mode based on the general event is adopted, the semantic role of the PA structure and the attribute of the background knowledge ontology are in one-to-one correspondence at the moment.

If a construction mode based on a specific event is adopted, at this time, the semantic role of the PA structure and the parent attribute of the background knowledge ontology are in a one-to-one correspondence relationship, and the algorithm flow chart is as shown in fig. 5.

3) Matching of arguments

The argument corresponds to the argument class in the background ontology.

Matching of argument is similar to matching of predicate, and matching is performed according to semantic similarity and threshold, and the matching algorithm is as follows.

And searching the argument class with the text attribute value as the predicate in the argument class in the background knowledge ontology, if so, matching the argument class, and if not, continuing the next step.

And searching synonyms of the argument class with the text attribute value as the argument in the argument class in the background knowledge ontology, if so, matching the argument class, and if not, continuing the next step.

And calculating the similarity of the predicate and the argument classes in each background knowledge ontology, and adding the argument classes and the similarity into the set to be matched if the similarity of the predicate and the argument classes is greater than a threshold value T for each argument class. And finally, extracting the argument class with the maximum similarity from the set to be matched, wherein if the similarity is greater than a threshold value T, the matching is successful, otherwise, the matching is failed, and the argument is represented by an argument character string.

3. Design and implementation of prototype systems

3.1 mapping of PA Structure to background knowledge ontology

By the foregoing description, the overall system is divided into 2 large blocks: PA structure and background knowledge ontology mapping module, knowledge base module.

The function of the PA structure and background knowledge ontology mapping module is to convert the PA structure into a semantic structure and store the semantic structure in a knowledge base, and the module generates a large amount of knowledge base access operations.

The function of the knowledge base module is storage, query and modification of RDF data, and can provide an access interface for convenient access of users. The overall class diagram of the prototype is shown in fig. 6.

The PA class is the encapsulation of the PA structure, and stores the PA structure in the form of key-value pairs by using a Map data set of Java. The TextPreprocessing class is a character preprocessing module. KBUtils, kbprox implement KBUtils dao interface, daofactuary is a factory class that produces KBUtils dao, which together make up the access of the knowledge base and the encapsulation of operations.

3.2 implementation of PA Structure and background knowledge ontology mapping Module

1) Implementation of word similarity calculation algorithm

The word similarity calculation is implemented first. The word similarity calculation is calculated on the basis of a thesaurus of synonym forest expansion edition, and is the basis of predicate matching, semantic role matching and argument matching. The above gives the idea and flow chart, and the algorithm is implemented as follows:

2) implementation of predicate and argument matching algorithm

Matching of predicates, semantic roles and arguments mainly relates to matching of words and a background ontology, in the above, in order to achieve matching of the words and the background ontology, text attributes are set for each resource of the background knowledge ontology and serve as description character strings of the resource, values of the attributes are a plurality of words and are separated by commas, and therefore matching of the words and the background ontology is converted into matching of the words, and matching of the words is achieved on the basis of a word similarity meter algorithm achieved in the above.

Setting a threshold value as T, setting a matched word as word1 and a word to be matched as word2, and adding a word2 into a set W to be matched of the word1 if the similarity between the word1 and the word2 is greater than or equal to T, wherein the implementation process of the algorithm is as follows:

the predicate matching implementation idea is introduced above, and the matching idea of the predicate and the argument is basically consistent, where the two concepts are uniformly abstracted into one function implementation:

3) implementation of semantic role matching algorithm

The semantic role matching implementation concept is introduced above, and the algorithm is implemented according to the implementation concept:

the embodiment converts natural language segmentation to be processed into a grammar structure; converting the converted grammar structure into a predicate-argument structure on the basis of the converted grammar structure; and establishing a mapping relation between the converted predicate-argument structure and the final semantic structure through a preset ontology mapping algorithm based on a pre-constructed background knowledge ontology, and finishing the extraction of semantic information. On the basis of the original semantic information extraction, a predicate-argument structure (PA structure) is introduced as an intermediate form of a grammar structure and a semantic structure, so that the program complexity is greatly reduced, and the possibility is provided for various mapping modes of a system; the mapping algorithm designed by the invention reduces the query of the knowledge base and the comparison of word similarity as much as possible, and reduces the time complexity of the algorithm.

Furthermore, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A dynamic ontology knowledge base construction method is characterized by comprising the following steps:

2. The method for dynamically building ontology knowledge base according to claim 1, wherein the background knowledge ontology comprises an event class ontology, a semantic role class ontology and a argument class ontology; the event type ontology corresponds to a predicate in a predicate-argument structure; the argument class ontology corresponds to arguments in the predicate-argument structure.

3. The method for dynamically building ontology knowledge base according to claim 2, wherein the event class ontology comprises two event classes, namely a generic event class and a special event class.

4. The method for dynamically building ontology knowledge base according to claim 1, wherein the semantic structure is represented by Resource Description Framework (RDF) based on ontology theory.

5. The method for dynamically building ontology knowledge base according to claim 1, wherein after the extraction of semantic information is completed, the method for dynamically building ontology knowledge base further comprises:

utilizing an Eclipse RDF4J framework to realize the storage management of semantic information and constructing a body knowledge base; and encapsulates the access operation of the ontology knowledge base based on the API interface provided by the RDF4J development kit.

6. The method for dynamically building ontology knowledge base according to claim 1, wherein each class in the background knowledge ontology has a text attribute, and the attribute value is a description string corresponding to the class; the ontology mapping algorithm comprises a predicate matching algorithm, a semantic role matching algorithm and an argument matching algorithm.

7. The method for dynamically building an ontology knowledge base according to claim 6, wherein the predicate matching algorithm comprises the following steps:

8. The method for dynamically building ontology knowledge base according to claim 7, wherein the semantic role matching algorithm comprises:

9. The method for dynamically building ontology knowledge base according to claim 8, wherein the argument matching algorithm comprises:

10. The method for dynamically constructing ontology knowledge base according to any one of claims 7 to 9, wherein the preset word similarity algorithm is a similarity algorithm based on synonym forest expansion, and comprises the following steps:

Sim(A,B)＝f

wherein f is a preset constant;