CN110717034A

CN110717034A - Ontology construction method and device

Info

Publication number: CN110717034A
Application number: CN201810670149.9A
Authority: CN
Inventors: 展丽霞; 邵勇; 王圣
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-06-26
Filing date: 2018-06-26
Publication date: 2020-01-21
Also published as: WO2020001373A1

Abstract

The embodiment of the invention provides a body construction method, which comprises the following steps: acquiring text data to be processed; extracting entity data and event data from the text data to be processed; predicting entity relationships among the entity data based on the text data to be processed; and performing semantic analysis on the event data, and generating an event system network consisting of event relations among the event data based on an analysis result to obtain an ontology comprising the entity data, the event data, the entity relations and the event system network. Therefore, the ontology constructed in the scheme comprises entity data, event data, entity relations and an event system network, and is more perfect compared with the ontology constructed in the existing scheme.

Description

Ontology construction method and device

Technical Field

The invention relates to the technical field of computer application, in particular to a body construction method and device.

Background

In the field of data processing, in order to abstract and organize concrete objects in the real world into a data model supported by a certain database, it is generally required to abstract objective objects in the real world into a certain information structure, and the information structure is not dependent on a concrete computer system, is not a data model supported by a certain database, but is a model at a concept level, which is called a conceptual model. The concept model commonly recognized among users is called a shared concept model.

The ontology is a definite formal specification of the shared concept model, and in short, the ontology is an accurate mathematical description of some concept models, and the description can be used as consensus among users, so that a more intelligent knowledge graph is provided for the users.

The ontology may include entities, events and relationships, wherein the entities refer to some concepts with recognized meanings, such as names of people and places; the event generally comprises event participation objects, occurrence time, occurrence means, occurrence place and the like; relationships refer to associations between entities, such as person-to-person employment relationships.

The ontology constructed by the existing scheme includes entities, events, and relationships between the entities.

Disclosure of Invention

The embodiment of the invention aims to provide an ontology construction method, so that an ontology obtained by construction is more complete. The specific technical scheme is as follows:

the embodiment of the invention provides an ontology construction method, which comprises the following steps:

acquiring text data to be processed;

extracting entity data and event data from the text data to be processed;

predicting entity relationships among the entity data based on the text data to be processed;

and performing semantic analysis on the event data, and generating an event system network consisting of event relations among the event data based on an analysis result to obtain an ontology comprising the entity data, the event data, the entity relations and the event system network.

Optionally, the extracting entity data from the text data to be processed includes:

for each word in the text data to be processed, determining the part of speech of the word by performing corpus tagging on the word;

screening out words with parts of speech as nouns and semantic information as words to be processed;

screening out words which do not exist in a preset dictionary from the words to be processed to serve as candidate entity data;

and extracting the candidate entity data according to a preset entity extraction rule to obtain entity data.

Optionally, for each word, determining a part of speech of the word by performing corpus tagging on the word, includes:

aiming at each word, acquiring the transition probability, the state probability and the feature weight of the word from a pre-acquired feature template library; respectively calculating the probability of the word being different parts of speech according to the transition probability, the state probability and the characteristic weight; and taking the part of speech meeting the preset first probability condition as the part of speech of the word.

Optionally, the extracting event data from the text data to be processed includes:

identifying candidate event data from the text data to be processed; wherein the candidate event data comprises one or more of: the occurrence time, the participants, the evolution state of the event, the occurrence environment of the event and the occurrence conditions of the event;

and screening the identified candidate event data according to a preset event extraction rule, and taking the screened candidate event data as event data.

Optionally, the predicting the entity relationship between the entity data based on the text data to be processed includes:

marking the syntactic components of each word in the text data to be processed by using a syntactic structure model obtained by pre-training;

predicting the semantic role of each word by utilizing a semantic role labeling model obtained by pre-training according to the syntactic component of each labeled word;

and determining the semantic roles of the extracted entity data according to the predicted semantic roles of each word, and analyzing the entity relationship among the entity data.

Optionally, the syntactic structure model is obtained by training, with the following steps:

acquiring first sample data;

inputting the first sample data into a preset first training model, and obtaining an output result comprising an initial probability vector, a transition matrix and a state matrix of syntactic components of each word in the first sample data;

judging whether the output result meets a preset condition, if not, performing iterative adjustment on the preset first training model until the output result meets the preset condition to obtain the syntactic structure model;

the method for marking the syntactic component of each word in the text data to be processed by using the syntactic structure model obtained by pre-training comprises the following steps:

inputting the text data to be processed into the syntactic structure model to obtain an initial probability vector, a transition matrix and a state matrix of each word;

and marking the syntactic components of each word in the text data to be processed according to the initial probability vector, the transition matrix and the state matrix of each word.

Optionally, before predicting the semantic role of each word by using a pre-trained semantic role labeling model according to the syntactic component of each labeled word, the method further includes:

eliminating ambiguity of the marked syntactic component of each word in the text data to be processed to obtain a corrected syntactic component of each word;

the method for predicting the semantic role of each word by utilizing a semantic role labeling model obtained by pre-training according to the syntactic component of each labeled word comprises the following steps:

and predicting the semantic role of each word by utilizing a semantic role labeling model obtained by pre-training according to the corrected syntactic component of each entity data.

Optionally, the predicting the semantic role of each word by using a pre-trained semantic role labeling model according to the syntactic component of each labeled word includes:

inputting the text data to be processed after the syntactic components are labeled into a semantic role labeling model obtained through pre-training, and calculating the probability of multiple semantic roles existing between the words with each syntactic component labeled as a predicate and other words; and taking the semantic role meeting the preset second probability condition as the semantic role between the word with the syntactic component labeled as the predicate and other words.

Optionally, the semantic role labeling model is obtained by training through the following steps:

acquiring second sample data;

analyzing the acquired second sample data; wherein the analytical process comprises one or more of: word segmentation processing, part of speech tagging processing and syntactic analysis processing;

deleting data which cannot serve as semantic roles in the analyzed second sample data according to a preset deleting rule to obtain training data;

and training a preset second training model by using the training data to obtain a semantic role labeling model.

Optionally, the performing semantic analysis on the event data, and generating an event system network composed of event relationships among the event data based on an analysis result includes:

performing semantic analysis on the event data, and constructing an event occurrence sequence based on an analysis result;

determining event relations among the event data based on the event occurrence sequence, and generating an event system network consisting of the event relations among the event data; wherein the event relationship comprises one or more of: causal, concomitant and compliance relationships.

Optionally, the performing semantic analysis on the event data, and constructing an event occurrence sequence based on an analysis result includes:

performing semantic meaning resolution on the event data aiming at each event data, and determining the semantics of the event data;

and determining the occurrence sequence of each event data according to the semantics of each event data, and constructing an event occurrence sequence according to the occurrence sequence.

Optionally, the determining, based on the event occurrence sequence, a semantic relationship between the event data, and generating an event system network composed of event relationships between the event data includes:

constructing a directed acyclic graph according to the event occurrence sequence;

calculating event transition probability among event data based on the directed acyclic graph;

and determining the event relation among the event data according to the event transition probability among the event data, and generating an event system network consisting of the event relation among the event data.

Optionally, the method further includes:

evaluating the body by using a preset evaluation rule to obtain an evaluation result;

judging whether the body meets a preset expected condition or not according to the evaluation result;

and if the expected conditions are met, displaying the body.

Optionally, the obtaining an ontology including the entity data, the event data, the entity relationship, and the event system network includes:

obtaining a body template;

and mapping the entity data, the event data, the entity relationship and the event system network to the ontology template to obtain an ontology.

Optionally, after the entity data, the event data, the entity relationship, and the event system network are mapped to an original ontology template to obtain an ontology, the method further includes:

updating the body template to obtain a new body template;

and mapping the entity data, the event data, the entity relation and the event system network to the new ontology template to obtain a new ontology.

An embodiment of the present invention further provides an ontology constructing device, where the ontology constructing device includes:

the data acquisition device is used for acquiring text data to be processed;

the information extraction module is used for extracting entity data and event data from the text data to be processed;

the entity relationship extraction module is used for predicting the entity relationship among the entity data based on the text data to be processed;

and the event relation extraction module is used for performing semantic analysis on the event data, generating an event system network consisting of event relations among the event data based on an analysis result, and obtaining an ontology comprising the entity data, the event data, the entity relations and the event system network.

Optionally, the information extraction module is specifically configured to:

Optionally, the entity relationship extracting module is specifically configured to:

acquiring first sample data;

the entity relationship extraction module is specifically configured to:

inputting the text data to be processed into the syntactic structure model to obtain an initial probability vector, a transition matrix and a state matrix of each word; and marking the syntactic components of each word in the text data to be processed according to the initial probability vector, the transition matrix and the state matrix of each word.

Optionally, the entity relationship extracting module is further configured to:

the entity relationship extraction module is specifically configured to:

acquiring second sample data;

Optionally, the event relationship extraction module is specifically configured to:

Optionally, the apparatus further comprises:

the body evaluation module is used for evaluating the body by utilizing a preset evaluation rule to obtain an evaluation result; judging whether the body meets a preset expected condition or not according to the evaluation result; and if the expected conditions are met, displaying the body.

Optionally, the apparatus further comprises:

the template mapping module is used for acquiring a body template; and mapping the entity data, the event data, the entity relationship and the event system network to the ontology template to obtain an ontology.

Optionally, the template mapping module is further configured to:

updating the body template to obtain a new body template;

The embodiment of the invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

and a processor for implementing any one of the above-described ontology construction methods when executing the program stored in the memory.

Embodiments of the present invention further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute any one of the above ontology construction methods.

The ontology construction method and the ontology construction device provided by the embodiment of the invention can predict the entity relationship between the obtained entity data by extracting the entity data and the event data from the text data to be processed, perform semantic analysis on the obtained event data, and generate an event system network consisting of the event relationship between the event data based on the analysis result to obtain the ontology comprising the entity data, the event data, the entity relationship and the event system network. Therefore, the ontology constructed in the scheme comprises entity data, event data, entity relations and an event system network, and is more perfect compared with the ontology constructed in the existing scheme. It is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of an ontology construction method according to an embodiment of the present invention;

fig. 2 is another schematic flow chart of an ontology construction method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an ontology constructing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, the ontology is constructed by: entities, events, relationships between entities and entities. The entity refers to some concepts with recognized meanings, such as a person name, a place name and the like; the event generally comprises event participation objects, occurrence time, occurrence means, occurrence place and the like; entity-to-entity relationships refer to entity-to-entity associations, such as person-to-person employment relationships and the like.

Compared with the prior art, the embodiment of the invention provides the ontology construction method, and a computer, a server or other electronic equipment can construct the ontology by using the method.

The ontology constructed by the method includes, in addition to entity data, event data, and relationships between entities, an event hierarchy network, where the event hierarchy network may embody relationships between events, for example, a causal relationship may be between event a and event B, or event B may be caused only by occurrence of event a.

The ontology construction method provided by the embodiments of the present invention is generally described below.

Acquiring text data to be processed;

extracting entity data and event data from the text data to be processed;

Therefore, the ontology constructed in the scheme comprises the entity data, the event data, the entity relation and the event system network, and is more perfect compared with the ontology constructed in the existing scheme.

The ontology construction method provided by the embodiment of the invention will be described in detail through specific embodiments.

As shown in fig. 1, a schematic flow chart of an ontology construction method provided in the embodiment of the present invention includes the following steps:

s101: and acquiring text data to be processed.

Sometimes, a user needs to organize and summarize various information in a certain field or query certain information in a certain field, for example, in the field of interpersonal communication, the user may need to know a relationship network between people or a group of people who have a communication with a certain person. Various information in the field is generally derived from a large amount of raw text data, which takes a lot of time and effort if processed by a human being in a large manner.

In this case, an ontology in the field can be constructed, and through the ontology in the field, organization and analysis of various information in the field can be conveniently realized, and functions such as information query can be provided for a user.

When the ontology is constructed, text data to be processed can be obtained first, wherein the text data to be processed is some text data subjected to word segmentation, and the text data to be processed includes a large number of words. In the embodiment of the present invention, the text data to be processed may be directly obtained, or the original text data may be obtained first, and the text data to be processed is obtained by performing natural language processing methods such as preprocessing and word segmentation on the obtained original text data, which is not limited in the embodiment of the present invention.

In one implementation, the text data to be processed may be obtained by:

first, original text data is obtained, wherein the original text data comprises data in various searched flat files, various network data collected by using a web crawler technology, data provided by a user and the like. Then, the original text data can be cleaned and fused, a large amount of garbage data contained in the original text data are removed, screened useful data are integrated, the original text data obtained from each channel are subjected to standardization processing, differences among heterogeneous data files with different formats are eliminated, the original text data are converted into processable structured data or unstructured text data, and a data asset pool is obtained.

And then, word segmentation processing can be carried out on the text data in the data asset pool, words in the text data are identified, and the text data to be processed is obtained. The process of performing word segmentation processing on the text data in the data asset pool and identifying words in the text data can adopt a shortest path algorithm:

and segmenting the acquired original text data into a plurality of word string data, and constructing an association diagram between the word string data according to the association relationship between the word string data. And then, calculating the association diagram by using a preset word frequency probability algorithm to obtain the word frequency probability of each associated word of the word string data. For each word string data, according to the word frequency probability of each associated word of the word string data, ambiguity generated when the original text data is segmented is eliminated, for example, if the original text data is assumed to be "my exact address is here", when the original text data is segmented, the original text data may be segmented into "my \ exactly \ address \ at \ here", ambiguity is generated in the segmentation process, so ambiguity elimination is required, and words in the original text data are identified more accurately.

Alternatively, an n-gram model method, a maximum matching algorithm, a cross ambiguity algorithm, and the like may also be adopted, which is not limited in the embodiment of the present invention.

S102: and extracting entity data and event data from the text data to be processed.

After the text data to be processed is obtained, further, entity extraction and event extraction can be performed on the text data to be processed, and entity data and event data can be obtained from the text data to be processed. The entity extraction and the event extraction of the text data to be processed can be performed simultaneously or sequentially according to a certain sequence, which is not limited in the embodiment of the present invention.

In the embodiment of the present invention, the entity data refers to concepts having recognized meanings, such as a person name, a place name, and the like, and the event data refers to a thing having a plurality of event elements.

For example, if an ontology of the interpersonal relationship field is to be constructed, in the ontology, the entity data may be a certain person, such as "zhangsan", "lee-quan", or a certain place, such as "beijing city", "certain hotel", etc., and meanwhile, each entity data also has its corresponding attribute, such as sex and age of "zhangsan", or an area and time zone of "beijing city"; the event data can be things that happen between people, such as "Zhang three attacks Li four in September thirteen", and the following event elements are included in the event data: the subject object "zhang san", the object "liyi", the event means "attack", and the event time "september thirteen", the event data may also be things that occur between people and places, for example, "zhang san stays in a certain hotel in september thirteen", and the event data includes the following event elements: the subject object "zhang san", the object "a certain hotel", the event means "stay in", and the event time "september thirteen".

Specifically, in one implementation, the entity data may be extracted from the text data to be processed by:

firstly, performing corpus tagging on each word in the text data to be processed, and determining the part of speech of each word, wherein the part of speech of each word can be a noun, a verb, an adjective and the like. Specifically, when corpus tagging is performed, for each word, a conditional random field model is used to obtain a transition probability, a state probability and a feature weight of the word from a feature template library obtained in advance, and then probabilities that the word is of different parts of speech are respectively calculated according to the transition probability, the state probability and the feature weight; and taking the part of speech meeting a preset first probability condition as the part of speech of the word, wherein the first probability condition generally refers to the maximum probability.

The transition probability of a word refers to the probability that the next word of each word corresponds to a different part of speech in the text data to be processed, for example, assuming that the current word is a verb, the probability that the next word is a noun can be calculated to be x1, the probability that the next word is a verb is x2, and the like. The state probability is the probability that the ith position is marked as a certain part of speech, e.g., the probability that the first word of each sentence is a noun is y1, the probability that the second word of each sentence is a verb is y2, and so on. The weight mainly represents the probability that the word corresponds to different parts of speech, for example, the probability that the current word is a noun is m1, the probability that the current word is a verb is m2, and the like. By utilizing the Viterbi algorithm, the probability that each word is different in part of speech can be calculated through the transition probability, the state probability and the characteristic weight of each word.

Then, the words whose part of speech is noun are screened out, the semantic information of the words is identified, for example, "zhang san" is a name of a person, "beijing city" is a place name, and "very" has no semantic information, and the like, wherein the identified words with the semantic information can be used as the words to be processed.

Further, words which do not exist in a preset dictionary can be screened out from the words to be processed to serve as candidate entity data, wherein the preset dictionary refers to a default dictionary preset in the natural language processing technology, and the words included in the preset dictionary can be considered as known words and cannot serve as the candidate entity data.

And finally, extracting the candidate entity data according to a preset entity extraction rule to obtain entity data, namely establishing a corresponding relation between each attribute of the entity and the entity. The entity extraction rule can be set according to the requirements of the user, for example, if an ontology about interpersonal relationships is to be constructed, in the ontology, only entities such as "people" are needed, but entities such as "places" are not needed; or, the template of the entity may be set, for example, for each "person", it may only need his age and gender attribute, but not other attributes such as his native place, constellation, etc., if a person has only an age attribute but no gender attribute, the gender attribute of the person may be marked as null, so that the entity data that is required by the user and has a uniform format is obtained. On one hand, the calculation amount in the body construction process is reduced, and on the other hand, the storage and the query of entity data are facilitated.

In the embodiment of the invention, the extraction of the event data from the text data to be processed can be directly realized by performing text extraction on the text data to be processed.

First, candidate event data may be identified directly from the text data to be processed, wherein each candidate event data is composed of one or more event elements, such as: the occurrence time of the event, the participants, the evolution state of the event, the occurrence environment of the event, the occurrence conditions of the event, and the like.

Then, the identified candidate event data may be screened according to a preset event extraction rule, and the screened candidate event data is used as event data, and similarly, the event extraction rule may also be set according to the requirement of the user, for example, if an ontology about interpersonal relationship is to be constructed, in the ontology, only events whose participating party is "human" may be used, but events whose participating party is not "machine" may be used; alternatively, a template of the event data may be set, for example, for each event, other elements such as the occurrence time of the event, the event occurrence environment, the event occurrence condition, and the like may not be needed as long as its participant and the event evolution state are needed, so that event data that is needed by the user and has a uniform format is obtained, and the amount of computation in the ontology construction process is further reduced. Moreover, according to different elements of the event data, the event data can be classified and stored, and the subsequent query of the event data is further facilitated.

S103: and predicting entity relations among the entity data based on the text data to be processed.

After the entity data are obtained, the relation between the entity data can be predicted by combining the context of the entity data in the text data to be processed, and the entity relation is obtained.

Continuing with the above example, assuming that the constructed ontology is about the realm of interpersonal relationships, the entity relationships between the entity data can be interpersonal relationships, e.g., the entity relationship between "zhang san" and "lie si" is a "employment relationship," the entity relationship between "zhang san" and "wang wu" is a co-worker relationship, etc.; alternatively, the entity relationship between the entity data may be a relationship between a person and a place, for example, the entity relationship between "zhangsan" and "school a" is "mother school and student", the entity relationship between "liquan" and "school a" is also "mother school and student", then the entity relationship between zhangsan "and" liquan "may be inferred to be" schoolmate ", and so on.

In one implementation, the relationship between entity data may be predicted as follows:

firstly, marking the syntactic component of each word in the text data to be processed by using a syntactic structure model obtained by pre-training. The syntactic component of each word refers to the component of the word in the sentence to which it belongs, including subjects, predicates, objects, subjects, and the like. The syntactic structure model is obtained by training a preset first training model through first sample data, and the first training model may be a markov model, a neural network model, or another model for machine learning, which is not limited in the embodiment of the present invention.

And secondly, predicting the semantic role of each word by utilizing a semantic role labeling model obtained by pre-training according to the syntactic component of each labeled word. Semantic roles mainly refer to semantic role relationships between words whose syntactic components are labeled as predicates and other words, and mainly include actors, respondents, objects, experiencers, beneficiaries, tools, places, targets, sources, and the like.

For example, assume that the text data to be processed is: "zhang san/attack/lie si", through syntactic component labeling, in the text data to be processed, "zhang san" is the subject, "attack" is the predicate, and "lie si" is the object, then according to the relationship between "zhang san" and "lie si" and the predicate "attack", the semantic role of "zhang san" can be labeled as the actor, that is, the active side of a certain action, and the semantic role of "lie si" can be labeled as the victim, that is, the passive side of a certain action.

The semantic role labeling model is obtained by training a preset second training model through second sample data, and the second training model may be a support vector machine model, a KNN (K-Nearest neighbor classification) model or other models for machine learning.

In the embodiment of the present invention, the first sample data and the second sample data are generally different data, and for convenience of description, data used for training the syntactic structure model is referred to as first sample data, and data used for training the semantic character labeling model is referred to as second sample data.

And thirdly, determining the semantic roles of the extracted entity data according to the predicted semantic roles of each word, and analyzing the entity relationship among the entity data. In the last step, the semantic role of each word is obtained, and then, matching can be performed between the semantic role of each word in the text data to be processed and the extracted entity data, so as to determine the semantic role of each entity data. Furthermore, semantic roles of the entity data can be analyzed, so that entity relationships among the entity data can be obtained.

For example, continuing with the above example, after determining the semantic roles of two words "zhang san" and "lie si", the two words "zhang san" and "lie si" may be matched with the entity data extracted in the previous step, and the semantic roles of the two words "zhang san" and "lie si" may be converted into the semantic roles of the two entity data "zhang san" and "lie si". Then, the semantic roles of the two entity data may be analyzed in combination with a preset entity relationship extraction rule, for example, according to the semantic roles of "zhang san" and "lie san" and the event means themselves, that is, the actor, the victim, and the predicate "attack", it may be determined that the entity relationship between "zhang san" and "lie san" is the actor and the victim.

Or, other ways may also be adopted to perform entity relationship extraction, for example, an entity relationship extraction method based on a kernel function, where the method directly uses an original form of a character string as a processing object, and implements entity relationship extraction by calculating a kernel function between any two processing objects; for example, an entity relationship extraction method based on deep learning is provided, the method uses a recurrent neural network to realize entity relationship extraction, syntax analysis is firstly carried out on text data to be processed, then vector representation is learned for each node on a syntax tree, iterative combination is carried out from a word vector at the lowest end of the syntax tree through the recurrent neural network according to the syntax structure of the text data to be processed, finally vector representation of each sentence in the text data to be processed is obtained, and entity relationship classification is further carried out; the embodiment of the present invention is not limited thereto.

S104: and performing semantic analysis on the event data, generating an event system network consisting of event relations among the event data based on the analysis result, and obtaining an ontology comprising entity data, the event data, the entity relations and the event system network.

Furthermore, semantic analysis can be performed on the extracted event data, and based on the semantics of the event data, the semantic relation between the event data can be extracted, so that an event system network formed by the event relation between the event data is generated.

Specifically, in the first step, semantic analysis may be performed on the extracted event data, and based on the analysis result, an event occurrence sequence is constructed.

When semantic analysis is carried out, semantic meaning resolution can be carried out on each event data, and meanings of the terms such as 'you', 'I', 'he' and the like in the event data are determined, so that the accuracy of semantic analysis can be improved. After the semantics of each event data is obtained, the occurrence sequence of each event data can be further determined by using a natural language reasoning algorithm, and then an event occurrence sequence can be constructed according to the occurrence sequence of the event data, wherein the event occurrence sequence refers to an event sequence formed by connecting each event data according to the occurrence sequence.

For example, "zhang san stays in a certain hotel in september thirteen" and "he leaves from this in september fourteen" are two event data, and according to the context in the text to be processed, firstly, semantic reference resolution is performed, and "he" and "here" refer to "zhang" and "certain hotel", respectively, that is, the semantic meaning of "zhang san leaves from this in september fourteen" is "zhang leaves from this certain hotel in september fourteen", further, the occurrence sequence of events can be deduced from these two event data, zhang san must stay in a certain hotel first and can leave from a certain hotel, namely, the event occurrence sequence is from "zhang san stays in september thirteen" to "he leaves from this in september fourteen".

In the second step, event relations among the event data can be determined based on the constructed event occurrence sequence, and an event system network composed of the event relations among the event data is generated, wherein the event relations comprise causal relations, accompanying relations, sequential relations and the like.

After the event occurrence sequence is obtained, a directed acyclic graph can be constructed based on the event occurrence sequence, and further, the obtained directed acyclic graph can be calculated by using algorithms such as a bayesian network model and the like to obtain the event transition probability among event data, that is, the probability that certain event data is likely to be further developed into other event data from the event data. Then, according to the event transition probability among the event data, the event relation among the event data is determined, and an event system network composed of the event relation among the event data is generated.

For example, for each event relationship, the corresponding event transition probability is different, for example, if the event data a and the event data B are causal relationships, the event transition probability between the event data a and the event data B may be 50%, and if the event data a and the event data B are concomitant relationships, the event transition probability between the event data a and the event data B may be 20%, and the like.

Through the steps, the text data to be processed is analyzed, and entity data, event data, entity relations and an event system network are obtained, in other words, an ontology is obtained. Because the ontology comprises the event system network, the obtained ontology is a network structure, and compared with a hierarchical structure in the prior art, the obtained ontology can better embody the relationship between entity data and event data.

In one implementation, an ontology template may be preset, and similar to the template of the event data and the template of the event data, a format of data required by the ontology is set in the ontology template, for example, what specific attributes are possessed by each entity data, what specific elements are possessed by each event data, what entity relationships between the entity data may be, what event relationships between the event data may be, and the like. And then, the obtained entity data, event data, entity relation and event system network can be mapped to a preset ontology template, so that the obtained data in the ontology is more standard, and the query of a user is further facilitated.

Moreover, the preset body template can be updated at any time according to the requirements of users, the required entity data, event data, entity relationship and the format of the event system network are added or deleted to obtain a new body template, and then the entity data, the event data, the entity relationship and the event system network can be mapped to the new body template to obtain a new body, so that the updating and the upgrading of the body are realized, and the information loss caused by the solidification of the body template is reduced.

Further, after the ontology is obtained, the ontology may be evaluated using a preset evaluation rule. For example, some expert knowledge may be used to evaluate the accuracy of entity data, event data, entity relationships, and event system networks in the ontology, determine whether these data extracted from the text data to be processed by the ontology conform to the common principles, and so on.

And judging whether the obtained body meets a preset expected condition or not according to the evaluation result, and if so, displaying the body. During display, entity data, event data, entity relations and event system networks in the ontology can be drawn into a relation graph, and the knowledge graph in the ontology is displayed for a user visually, so that the user can browse conveniently.

As can be seen from the above, the ontology construction method provided in the embodiment of the present invention predicts the entity relationship between the entity data in the obtained semantic metadata by extracting the entity data and the event data from the text data to be processed, performs semantic analysis on the event data in the obtained semantic metadata, and generates an event system network composed of the event relationship between the event data based on the analysis result, thereby obtaining an ontology including the entity data, the event data, the entity relationship, and the event system network. Therefore, the ontology constructed in the scheme comprises entity data, event data, entity relations and an event system network, and is more perfect compared with the ontology constructed in the existing scheme.

As shown in fig. 2, another schematic flow chart of an ontology construction method provided in the embodiment of the present invention includes the following steps:

s201: and acquiring text data to be processed.

S202: and extracting entity data and event data from the text data to be processed.

In one implementation, the entity data may be extracted from the text data to be processed by:

firstly, performing corpus tagging on each word in text data to be processed, determining the part of speech of each word, then screening out words of which the part of speech is a noun, identifying semantic information of the words, further screening out words which do not exist in a preset dictionary from the words to be processed as candidate entity data, and finally extracting the candidate entity data according to a preset entity extraction rule to obtain entity data, namely establishing a corresponding relation between each attribute of the entity and the entity.

First, candidate event data may be identified directly from the text data to be processed, wherein each candidate event data is composed of one or more event elements, such as: the occurrence time of the event, the participants, the evolution state of the event, the occurrence environment of the event, the occurrence conditions of the event, and the like. Then, the identified candidate event data may be screened according to a preset event extraction rule, and the screened candidate event data may be used as event data.

S203: and marking the syntactic component of each word in the text data to be processed by using the syntactic structure model obtained by pre-training.

The syntactic component of each word refers to the composition of the word in the sentence to which the word belongs, and includes a subject, a predicate, an object, a shape, and the like.

In an implementation manner, the syntactic structure model is obtained by training a preset first training model through first sample data, where the first training model may be a markov model, a neural network model, or another model for machine learning, and the embodiment of the present invention does not limit this.

The obtained first sample data is input into a preset first training model, and an obtained output result comprises an initial probability vector, a transition matrix and a state matrix of syntax components of each word in the first sample data, wherein the initial probability vector refers to the probability that each word in the first sample data corresponds to different syntax components in a sentence to which the word belongs under the current state, the transition matrix refers to the probability that each word in the first sample data is converted from a certain syntax component to another syntax component, and the state matrix refers to all possible syntax components corresponding to the word.

Meanwhile, whether the output result meets a preset condition is judged, if not, iteration adjustment is carried out on the preset first training model until the output result meets the preset condition, and thus a syntactic structure model is obtained. The preset condition may be that the number of iterations in the model training process is limited, for example, when the number of iterations reaches 500, it may be considered that the syntactic structure model has been trained; or, the preset condition may also be a limit on the accuracy of the trained model, for example, the first sample data is divided into two parts, namely training data and test data, and the test data is used to determine whether the syntactic component result output by the trained model is accurate, and if the accuracy reaches a preset threshold, the syntactic structure model may be considered to have been trained.

The text data to be processed is input into the syntactic structure model, an initial probability vector, a transition matrix and a state matrix of each word in the text data to be processed can be obtained, then the probability of syntactic components of each word in the text data to be processed can be calculated by using a corresponding algorithm, such as a Viterbi algorithm, according to the output result of the model, the maximum probability is used as the syntactic component of the word, and the word is labeled.

S204: and predicting the semantic role of each word by utilizing a semantic role labeling model obtained by pre-training according to the syntactic component of each labeled word.

The semantic role mainly refers to a semantic role relationship between a word with syntactic components labeled as a predicate and other words, and mainly includes an actor, an object, an experiencer, a beneficiary, a tool, a place, an object, a source and the like.

In one implementation, the text data to be processed after the syntactic components are labeled may be input into a semantic role labeling model obtained through pre-training, the probabilities of multiple semantic roles existing between the word in which each syntactic component is labeled as a predicate and other words are calculated, and a semantic role satisfying a preset second probability condition is taken as a semantic role between the word in which the syntactic component is labeled as the predicate and other words, where the second probability condition generally refers to the maximum probability.

For example, assume that the text data to be processed is: "zhang san/attack/lie si", labeled by syntactic components, in the text data to be processed, "zhang san" is the subject, "attack" is the predicate, and "lie si" is the object, the text data to be processed labeled by the syntactic components is input into a semantic role labeling model obtained by pre-training, the probability that various semantic roles are formed between "attack" and "zhang san" and "lie si" can be calculated, for example, the probability that the semantic role of "zhang san" is labeled as the actor is 90%, and the probability that the semantic role is labeled as the tool is 5%, then the semantic role meeting a preset second probability condition is selected, for example, the semantic role with the largest probability is selected, that is, the semantic role of "zhang san" is labeled as the actor, and similarly, the semantic role of "lie si" is labeled as the victim.

The semantic role labeling model is obtained by training a preset second training model through second sample data, where the second training model may be a support vector machine model, a KNN (K-Nearest neighbor classification) model, or another model for machine learning, which is not limited in the embodiments of the present invention. Specifically, the semantic role labeling model can be obtained by training through the following steps:

the method comprises the steps of firstly obtaining second sample data, then carrying out word segmentation processing, part of speech tagging processing, syntactic analysis processing and other operations on the obtained second sample data, then deleting data which cannot serve as semantic roles in the processed second sample data according to a preset deleting rule to obtain training data, and training a preset second training model by utilizing the training data to obtain the semantic role tagging model. Because the training data is processed by the processing process, the recognition performance of the moral semantic character labeling model of the village union is improved.

In step S203, the syntactic component of each word in the text data to be processed, which is calculated by using the syntactic structure model, is not unique, there may be a calculation error, and the syntactic component of each word may have an influence on the syntactic components of other words related to the word, for example, if a is a predicate, the probability that the next word of a is an object will be 50%, and if a is a subject, the probability that the next word of a is an object will be 10%, which may have a large influence on the subsequent calculation results.

In one implementation, ambiguity elimination can be performed on the sentence component labeling result before the syntax component of each labeled word is used for next calculation, so that the corrected syntax component of each word is obtained, and the semantic role of each word is predicted by using the semantic role labeling model obtained through pre-training according to the corrected syntax component of each labeled word, so that the accuracy of the predicted semantic role is improved.

When disambiguation is performed, the probability that each word is labeled as different syntactic components can be obtained first, the probabilities of the words can be multiplied, and the syntactic component of each word under the condition that the product of the probabilities in the whole sentence is maximum is used as the corrected syntactic component of each word; or, the user may manually review the syntax component, determine whether the syntax component labeling result is accurate, and the like.

S205: and determining the semantic roles of the extracted entity data according to the predicted semantic roles of each word, and analyzing the entity relationship among the entity data.

In S204, the semantic role of each word is obtained, and then, matching may be performed between the extracted entity data and the semantic role of each entity data according to the semantic role of each word in the text data to be processed, thereby determining the semantic role of each entity data. Furthermore, semantic roles of the entity data can be analyzed, so that entity relationships among the entity data can be obtained.

S206: and performing semantic analysis on the event data, and constructing an event occurrence sequence based on an analysis result.

After the event data in the text data to be processed is extracted, semantic analysis can be further performed on the event data, and an event occurrence sequence is constructed based on the analysis result.

S207: determining event relations among the event data based on the event occurrence sequence, generating an event system network consisting of the event relations among the event data, and obtaining an ontology comprising the entity data, the event data, the entity relations and the event system network; wherein the event relationship comprises one or more of: causal, concomitant and compliance relationships.

After the event occurrence sequence is obtained, a directed acyclic graph can be constructed based on the event occurrence sequence, and further, the obtained directed acyclic graph can be calculated by using algorithms such as a bayesian network model and the like, so that the event transition probability among event data is obtained, that is, for a certain event data, the probability that the event data can be further developed into other event data is obtained. Then, according to the event transition probability among the event data, the event relation among the event data is determined, and an event system network composed of the event relation among the event data is generated.

Corresponding to the ontology construction method, the embodiment of the invention also provides an ontology construction device.

As shown in fig. 3, a schematic structural diagram of an ontology constructing apparatus provided in an embodiment of the present invention is shown, where the apparatus includes:

data acquisition means 310 for acquiring text data to be processed;

an information extraction module 320, configured to extract entity data and event data from the text data to be processed;

an entity relationship extraction module 330, configured to predict, based on the text data to be processed, an entity relationship between the entity data;

the event relationship extraction module 340 is configured to perform semantic analysis on the event data, and generate an event system network composed of event relationships among the event data based on an analysis result, so as to obtain an ontology including the entity data, the event data, the entity relationships, and the event system network.

In one implementation, the information extraction module 320 is specifically configured to:

In an implementation manner, the entity relationship extraction module 330 is specifically configured to:

In one implementation, the syntactic structure model may be trained by the following steps:

acquiring first sample data;

the entity relationship extraction module is specifically configured to:

In one implementation, the entity relationship extraction module 330 is further configured to:

the entity relationship extraction module is specifically configured to:

In one implementation, the semantic role labeling model may be obtained by training as follows:

acquiring second sample data;

In an implementation manner, the event relation extraction module 340 is specifically configured to:

In one implementation, the apparatus further comprises:

a template mapping module 350 for obtaining a body template; and mapping the entity data, the event data, the entity relationship and the event system network to the ontology template to obtain an ontology.

In one implementation, the module mapping module 350 is further configured to:

updating the body template to obtain a new body template;

In one implementation, the apparatus further comprises:

the body evaluation module 360 is used for evaluating the body by using a preset evaluation rule to obtain an evaluation result; judging whether the body meets a preset expected condition or not according to the evaluation result; and if the expected conditions are met, displaying the body.

As can be seen from the above, the ontology construction apparatus provided in the embodiment of the present invention predicts the entity relationship between the entity data in the obtained semantic metadata by extracting the entity data and the event data from the text data to be processed, performs semantic analysis on the event data in the obtained semantic metadata, and generates an event system network composed of the event relationship between the event data based on the analysis result, thereby obtaining an ontology including the entity data, the event data, the entity relationship, and the event system network. Therefore, the ontology constructed in the scheme comprises entity data, event data, entity relations and an event system network, and is more perfect compared with the ontology constructed in the existing scheme.

An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

acquiring text data to be processed;

extracting entity data and event data from the text data to be processed;

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

It is noted that, in the text, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. Especially, as for the device embodiment and the electronic device embodiment, since they are basically similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of ontology construction, the method comprising:

acquiring text data to be processed;

extracting entity data and event data from the text data to be processed;

2. The method of claim 1, wherein the extracting entity data from the text data to be processed comprises:

3. The method of claim 2, wherein determining, for each word, a part of speech of the word by corpus tagging the word comprises:

4. The method of claim 1, wherein the extracting event data from the text data to be processed comprises:

5. The method of claim 1, wherein predicting entity relationships between the entity data based on the text data to be processed comprises:

6. The method of claim 5, wherein the syntactic structure model is trained using the steps of:

acquiring first sample data;

7. The method of claim 5, wherein before predicting the semantic role of each word using a pre-trained semantic role labeling model based on syntactic components of each word to be labeled, the method further comprises:

8. The method of claim 5, wherein predicting the semantic role of each word by using a pre-trained semantic role labeling model according to the syntactic component of each labeled word comprises:

9. The method of claim 5, wherein the semantic character labeling model is trained by the steps of:

acquiring second sample data;

10. The method according to claim 1, wherein the semantically analyzing the event data, and generating an event system network composed of event relations between the event data based on the analysis result comprises:

11. The method of claim 10, wherein the semantic analyzing the event data, and based on the analysis result, constructing an event occurrence sequence comprises:

12. The method according to claim 10, wherein the determining semantic relationships between the event data based on the event occurrence sequence and generating an event hierarchy network composed of event relationships between the event data comprises:

13. The method of claim 1, further comprising:

and if the expected conditions are met, displaying the body.

14. The method of claim 1, wherein obtaining an ontology that includes the entity data, the event data, the entity relationships, and the event architecture network comprises:

obtaining a body template;

15. The method of claim 14, wherein after the mapping the entity data, the event data, the entity relationships, and the event architecture network into an original ontology template to obtain an ontology, the method further comprises:

updating the body template to obtain a new body template;

16. An ontology-building apparatus, the apparatus comprising:

the data acquisition device is used for acquiring text data to be processed;

17. The apparatus of claim 16, wherein the information extraction module is specifically configured to:

18. The apparatus of claim 17, wherein the information extraction module is specifically configured to:

19. The apparatus of claim 16, wherein the information extraction module is specifically configured to:

20. The apparatus of claim 16, wherein the entity relationship extraction module is specifically configured to:

21. The apparatus of claim 20, wherein the syntactic structure model is trained using the steps of:

acquiring first sample data;

the entity relationship extraction module is specifically configured to:

22. The apparatus of claim 20, wherein the entity relationship extraction module is further configured to:

the entity relationship extraction module is specifically configured to:

23. The apparatus of claim 20, wherein the entity relationship extraction module is specifically configured to:

24. The apparatus of claim 20, wherein the semantic character labeling model is trained by the steps of:

acquiring second sample data;

25. The apparatus according to claim 16, wherein the event relation extraction module is specifically configured to:

26. The apparatus according to claim 25, wherein the event relation extraction module is specifically configured to:

27. The apparatus according to claim 25, wherein the event relation extraction module is specifically configured to:

28. The apparatus of claim 16, further comprising:

29. The apparatus of claim 16, further comprising:

30. The apparatus of claim 29, wherein the template mapping module is further configured to:

updating the body template to obtain a new body template;

31. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 15 when executing a program stored in the memory.