CN111259161B

CN111259161B - Ontology establishing method and device and storage medium

Info

Publication number: CN111259161B
Application number: CN201811459195.0A
Authority: CN
Inventors: 吴小飞; 浦世亮; 姜伟浩; 闫春
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2022-02-08
Anticipated expiration: 2038-11-30
Also published as: CN111259161A; WO2020108641A1

Abstract

The invention discloses a body establishing method, a body establishing device and a storage medium, and belongs to the technical field of big data processing. The method comprises the following steps: for at least one document used for establishing the ontology, a plurality of multi-tuple groups are determined according to the at least one document, a plurality of semantic relation data sets are determined according to the plurality of multi-tuple groups, and then the ontology aiming at the at least one document is established according to the plurality of semantic relation data sets. In the whole body building process, manual participation is not needed, and the body building efficiency is improved.

Description

Ontology establishing method and device and storage medium

Technical Field

The present invention relates to the field of big data processing technologies, and in particular, to a method and an apparatus for establishing an ontology, and a storage medium.

Background

An ontology is a data structure that is used to describe certain documents in a standard, canonical way to facilitate sharing of those documents according to the ontology. Wherein the ontology comprises a plurality of concepts and a concept relationship between two concepts in the plurality of concepts, wherein the two concepts have an association. For example, an ontology is created for documents describing an umbrella, the ontology including concepts such as umbrella, gear, handle, etc. Wherein, a concept relation exists between the concept umbrella and the concept appliance, and the concept relation is pointed to the appliance by the umbrella. There is also a conceptual relationship between the concept "umbrella" and the concept "handle" and the conceptual relationship is directed from the "handle" to the "umbrella".

In the related technology, when an ontology needs to be established, at least one document of the ontology established by a user is obtained, terms in each document in the at least one document are classified in a manual mode, and then the ontology is established according to the classified terms. The artificial participation degree in the body building process is high, and the body building efficiency is seriously influenced.

Disclosure of Invention

The embodiment of the invention provides a method and a device for establishing a body and a storage medium, which can improve the efficiency of establishing the body. The technical scheme is as follows:

in a first aspect, an ontology establishing method is provided, where the method includes:

obtaining at least one document used for establishing an ontology, and determining a plurality of multi-element groups according to the at least one document, wherein each multi-element group comprises two first-class words and one second-class word, the first-class words are words used for describing the self-attribute of an object, and the second-class words are words used for indicating the incidence relation between different objects;

determining two concepts corresponding to two first-class terms in each tuple in the multiple tuples and a concept relationship corresponding to a second-class term in each multiple tuple, respectively replacing the two corresponding first-class terms with the determined two concepts, and replacing the corresponding second-class terms with the determined concept relationship to obtain multiple semantic relationship data sets;

and establishing an ontology for the at least one document according to the plurality of semantic relation data groups.

Optionally, the determining two concepts corresponding to the two first-class terms in each of the multiple tuples and the concept relationship corresponding to the second-class term in each tuple includes:

for any tuple A in the multiple tuples, searching two concepts respectively corresponding to two first-class terms in the multiple tuples A from a reference database, wherein the reference database is used for describing the concept represented by each term in the multiple terms and the concept relationship between different concepts;

and according to the two searched concepts, continuously searching the concept relation corresponding to the second class of words in the tuple A from the reference database.

Optionally, the searching for two concepts corresponding to two first-class terms in the tuple a from the reference database includes:

for any first-class word B in two first-class words in the tuple A, determining a word in the at least one document, which has the same word meaning as the first-class word B;

and if the at least one document does not have a word with the same word meaning as the word meaning of the first type word B, searching the concept corresponding to the first type word B from the reference database.

Optionally, after determining a word in the at least one document that is the same as the word sense of the word B in the first category, the method further includes:

determining the occurrence times of the first category of words B and the words with the same word senses as the first category of words B in the at least one document respectively if the words with the same word senses as the first category of words B exist in the at least one document;

determining the words with the largest number of occurrences from the first-class words B and the words with the same word senses as the first-class words B;

and searching the concept corresponding to the word with the maximum occurrence frequency from the reference database, and taking the searched concept as the concept corresponding to the first word B.

Optionally, the method further comprises:

if the concept corresponding to the first category word B is not found from the reference database, a concept is created for the first category word B by an LDA (Latent Dirichlet Allocation) algorithm.

Optionally, the continuously searching, according to the two searched concepts, a concept relationship corresponding to the second term in the tuple a from the reference database includes:

determining a path between the two searched concepts from the reference database to obtain a plurality of paths;

selecting a target path from the multiple paths according to the path lengths of the multiple paths, wherein the target path comprises at least one conceptual relationship;

selecting the concept relationship with the maximum similarity between the second category of terms in the tuple A from the at least one concept relationship, and determining the selected concept relationship as the concept relationship corresponding to the second category of terms in the tuple A.

Optionally, the establishing an ontology for the at least one document according to the plurality of semantic relation data sets includes:

establishing a semantic relation graph according to the plurality of semantic relation data sets, wherein one node in the semantic relation graph corresponds to one concept in the plurality of semantic relation data sets, the relation between two nodes in the semantic relation graph is the concept relation between two corresponding concepts, the direction between the two nodes is the direction indicated by the concept relation between the two corresponding concepts, each node is configured with an in degree and an out degree, the in degree of each node refers to the number of nodes pointing to each node, and the out degree of each node refers to the number of nodes pointing to each node;

for any first node with the in-degree equal to 0 and the out-degree greater than 0, cutting the first node;

when all nodes with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relationship graph are cut, setting the in-degree of the nodes pointed by the nodes with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relationship graph to be 0;

and returning to execute the step of cutting any first node with the in-degree equal to 0 and the out-degree greater than 0 for other nodes except the cut nodes until all nodes in the semantic relation graph are traversed, and taking the finally obtained semantic relation graph as an ontology established for the at least one document.

Optionally, the cutting the first node includes:

determining at least one node in the semantic relationship graph, wherein the document position of the term in the at least one document indicated by the concept corresponding to the at least one node is adjacent to the document position of the term in the at least one document indicated by the concept corresponding to the first node;

and determining the node with the maximum connection degree with the first node in the at least one node, deleting other nodes except the determined node in the at least one node, and deleting the relationship between the other nodes and the first node.

Optionally, the determining a node with the greatest connection degree with the first node in the at least one node includes:

determining a degree of connectivity between each of the at least one node and the first node based on a first formula;

wherein the first formula is:

the Wi and the Wj are two nodes used for determining connectivity respectively, the Sim (Wi, Wj) is similarity between the node Wi and the node Wj, the Rel (Wi, Wj) is correlation between the node Wi and the node Wj, the alpha and the beta are respectively a weighting coefficient configured for the similarity and a weighting coefficient configured for the correlation, and the sum of the alpha and the beta is 1;

and determining the node with the maximum connection degree with the first node in the at least one node according to the connection degree determined according to the first formula.

Optionally, the determining, according to the connectivity determined according to the first formula, a node with the greatest connectivity to the first node in the at least one node includes:

if the maximum connectivity in the connectivity determined according to the first formula is smaller than the connectivity threshold, adjusting the values of the alpha and the beta to obtain an updated first formula;

determining a degree of connectivity between each of the at least one node and the first node based on the updated first formula;

and if the maximum connectivity in the connectivity determined according to the updated first formula is smaller than the connectivity threshold, returning to the step of adjusting the values of the alpha and the beta until the determined maximum connectivity is greater than or equal to the connectivity threshold, and determining the node corresponding to the maximum connectivity determined at the last time as the node with the maximum connectivity to the first node in the at least one node.

Optionally, before the cutting any first node with an in-degree equal to 0 and an out-degree greater than 0, the method further includes:

if an isolated node with the in-degree equal to 0 and the out-degree equal to 0 exists in the semantic relationship graph, configuring the isolated node to point to a node with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relationship graph to obtain an updated semantic relationship graph;

and executing the operation of clipping any first node with the in-degree equal to 0 and the out-degree greater than 0 based on the updated semantic relation graph.

Optionally, the determining a plurality of tuples from the at least one document comprises:

performing word segmentation processing on each document in the at least one document to obtain a plurality of words;

determining a part-of-speech of each of the plurality of words;

determining the plurality of tuples according to the part of speech of each word in the plurality of words and the document position of each word in the at least one document.

In a second aspect, there is provided an ontology creating apparatus, the apparatus comprising:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring at least one document used for establishing an ontology and determining a plurality of tuple groups according to the at least one document, each tuple group comprises two first-class words and one second-class word, the first-class words are words used for describing the self attribute of an object, and the second-class words are words used for indicating the incidence relation between different objects;

the determining module is used for determining two concepts corresponding to two first-class terms in each tuple in the multiple tuples and a concept relationship corresponding to a second-class term in each multiple tuple, replacing the two corresponding first-class terms with the determined two concepts, and replacing the corresponding second-class terms with the determined concept relationship to obtain multiple semantic relationship data sets;

and the establishing module is used for establishing an ontology aiming at the at least one document according to the plurality of semantic relation data groups.

Optionally, the determining module includes:

the first searching unit is used for searching two concepts respectively corresponding to two first-class terms in the multiple tuples A from a reference database for any one multiple tuple A in the multiple tuples A, wherein the reference database is used for describing the concept represented by each term in the multiple terms and the concept relationship between different concepts;

and the second searching unit is used for continuously searching the concept relationship corresponding to the second class of words in the tuple A from the reference database according to the two searched concepts.

Optionally, the first search unit is specifically configured to:

Optionally, the first search unit is further specifically configured to:

Optionally, the apparatus further comprises:

and the creating unit is used for creating a concept for the first category word B through an LDA algorithm if the concept corresponding to the first category word B is not found in the reference database.

Optionally, the second search unit is specifically configured to:

Optionally, the establishing module includes:

the establishing unit is used for establishing a semantic relation graph according to the plurality of semantic relation data sets, wherein one node in the semantic relation graph corresponds to one concept in the plurality of semantic relation data sets, the relation between two nodes in the semantic relation graph is the concept relation between two corresponding concepts, the direction between the two nodes is the direction indicated by the concept relation between the two corresponding concepts, each node is configured with an in-degree and an out-degree, the in-degree of each node refers to the number of nodes pointing to each node, and the out-degree of each node refers to the number of nodes pointing to each node;

the cutting unit is used for cutting any first node with the in-degree equal to 0 and the out-degree greater than 0;

the setting unit is used for setting the degree of entry of all nodes pointed by the nodes with the degree of entry equal to 0 and the degree of exit greater than 0 in the semantic relation graph to 0 when all the nodes with the degree of entry equal to 0 and the degree of exit greater than 0 in the semantic relation graph are cut;

the cutting unit is further configured to return to execute the step of cutting any first node for which the in-degree is equal to 0 and the out-degree is greater than 0 for other nodes except the cut node, until all nodes in the semantic relationship graph are traversed, and use the finally obtained semantic relationship graph as an ontology established for the at least one document.

Optionally, the clipping unit is specifically configured to:

Optionally, the clipping unit is further specifically configured to:

wherein the first formula is:

Optionally, the clipping unit is further specifically configured to:

Optionally, the establishing module further includes:

the configuration unit is used for configuring the isolated node to point to a node with the degree of in-degree equal to 0 and the degree of out-degree greater than 0 in the semantic relationship graph to obtain the updated semantic relationship graph if the isolated node with the degree of in-degree equal to 0 and the degree of out-degree equal to 0 exists in the semantic relationship graph;

and the cutting unit is also used for executing the operation of cutting any first node with the in-degree equal to 0 and the out-degree greater than 0 based on the updated semantic relationship graph.

Optionally, the obtaining module includes:

the word segmentation processing unit is used for carrying out word segmentation processing on each document in the at least one document to obtain a plurality of words;

a first determining unit configured to determine a part-of-speech of each of the plurality of words;

a second determining unit, configured to determine the multiple tuples according to a part of speech of each of the multiple words and a document position of each of the multiple words in the at least one document.

In a third aspect, an ontology creating apparatus is provided, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of any of the methods of the first aspect described above.

In a fourth aspect, a computer-readable storage medium is provided, having instructions stored thereon, which when executed by a processor, implement the steps of any of the methods of the first aspect described above.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of any of the methods of the first aspect described above.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, for at least one document used for establishing the ontology, a plurality of multi-tuples are determined according to the at least one document, a plurality of semantic relation data sets are determined according to the plurality of multi-tuples, and then the ontology aiming at the at least one document is established according to the plurality of semantic relation data sets. In the whole body building process, manual participation is not needed, and the body building efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an ontology establishing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for querying concepts and concept relationships according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a conceptual diagram provided by an embodiment of the invention;

fig. 4 is a schematic diagram of an ontology creating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of an ontology creating method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 101: the method comprises the steps of obtaining at least one document used for establishing an ontology, and determining a plurality of multi-element groups according to the at least one document, wherein each multi-element group comprises two first-class terms and one second-class term, the first-class terms are terms used for describing the self attributes of objects, and the second-class terms are terms used for indicating the association relation between different objects.

In the embodiment of the invention, in order to ensure that the established ontology can describe the content in at least one document as much as possible, when the ontology is established, a plurality of tuples are determined according to at least one document. Wherein, each tuple comprises two first-class words and one second-class word, and the first-class words refer to words for describing the property of the object, such as nouns. The second class of words refers to words, such as verbs, used to indicate associative relationships between different objects. And the words included in each multi-element group are words in at least one document, so that the subsequently established ontology is obtained according to the information in at least the document, and the accuracy of the established ontology is improved.

The ontology establishing method provided by the embodiment of the present invention may be executed by a terminal or a server, and the embodiment of the present invention is not specifically limited herein. Therefore, at least one document for establishing the ontology may be input into the terminal by the administrator in advance, or uploaded to the server by the administrator through the terminal in advance.

In addition, in one possible implementation, the determining the plurality of tuple details from the at least one document may be: performing word segmentation processing on each document in at least one document to obtain a plurality of words; determining a part-of-speech of each of a plurality of words; a plurality of tuples is determined based on the part-of-speech of each of the plurality of terms and the document location of each term in the at least one document.

The word segmentation processing on at least one document may adopt a forward maximum matching word segmentation method based on character string matching, and certainly may also adopt other word segmentation methods.

In addition, since each word may have different parts of speech, a plurality of words may be divided according to the position of each word in the document to obtain a plurality of word sequences, for example, all words in a piece of text may be combined into one word sequence. At this time, an implementation manner of determining the part of speech of each of the plurality of words may be: for any word sequence, combining different parts of speech according to different parts of speech corresponding to each word included in the word sequence to obtain a plurality of parts of speech sequences aiming at the word sequence, wherein each part of speech sequence includes the parts of speech corresponding to the words in the word sequence one by one. And judging the probability of each part of speech sequence in the plurality of part of speech sequences through an evaluation function, selecting the part of speech sequence with the highest probability from the plurality of part of speech sequences, and determining the part of speech in the part of speech sequence with the highest probability as the part of speech of the corresponding word.

In addition, since the plurality of words after the word segmentation process are arranged in the position order in the document, the implementation manner of determining the plurality of tuples according to the part of speech of each word in the plurality of words and the document position of each word in at least one document may be: for any first-class word, searching a second-class word which is ranked behind the first-class word and adjacent to the first-class word in a plurality of words, continuing to search a first-class word which is ranked behind the second-class word and adjacent to the second-class word, and combining the three words to obtain a multi-tuple.

Optionally, since at least one document may have words of other parts of speech besides the words of the first type and the words of the second type, such as adverbs or prepositions, and the like, and the degree of contribution of the words of other parts of speech to the establishment of the ontology is not large, before performing the word segmentation process on each document in the at least one document, a useless word filtering operation may be performed on at least one document to filter out useless words in the at least one document.

In addition, in the embodiment of the present invention, since the number of documents used for building the ontology is usually large, in order to improve the efficiency of building the ontology, step 101 may be performed by a computing model based on a distributed computing framework. Specifically, at least one document is divided into different partitions, and the step 101 can be executed in parallel among the partitions, so that the speed of establishing the ontology is improved. The dividing at least one document into different partitions may refer to dividing each document into one partition, or may refer to dividing adjacent paragraphs into one partition, or dividing one paragraph into one partition, which is not specifically limited in this embodiment of the present invention.

When at least one document is divided into different partitions, the following step 102 is executed in partition units, that is, each partition executes step 102 in parallel according to the document divided into itself.

Step 102: determining two concepts corresponding to two first-class terms in each tuple in the multiple tuples and a concept relationship corresponding to a second-class term in each multiple tuple, replacing the two corresponding first-class terms with the determined two concepts, and replacing the corresponding second-class terms with the determined concept relationship to obtain multiple semantic relationship data sets.

Since there are already a large number of databases for characterizing concepts, step 102 may be performed based on a reference database in embodiments of the present invention. That is, two concepts corresponding to the two first-class terms in each tuple and a concept relationship corresponding to the second-class term in each tuple are determined based on the reference database. Because the implementation manners of determining the two concepts corresponding to the two first-class terms in each tuple and the concept relationship corresponding to the second-class term in each tuple are basically the same, it is described below how to determine the two concepts corresponding to the two first-class terms in one tuple and the concept relationship corresponding to the second-class term in each tuple by taking any tuple a in the multiple tuples as an example. Specifically, the embodiment of the present invention provides a flowchart of a method for querying concepts and concept relationships, as shown in fig. 2, the method includes the following two steps.

Step 1021: and searching two concepts respectively corresponding to the two first-class terms in the multi-component group A from a reference database, wherein the reference database is used for describing the concept represented by each term in the plurality of terms and the concept relationship between different concepts.

The reference database may be a HowNet database. For the convenience of description, the HowNet database is briefly introduced here: the howle database is a database in which concepts represented by words of chinese and english are used as description objects to reveal relationships between concepts and between attributes possessed by the concepts. The HowNe database describes concepts and attributes of the concepts using a nested structure using KDML (Knowledge system description language). That is, a complex concept is explained by a simpler concept, and the simpler concept is explained by a simpler concept until it can be expressed by a sense, which is the most basic unit capable of expressing a sense. The structure is an implicit graph structure and is called a conceptual graph.

Fig. 3 is a schematic diagram of a conceptual diagram provided by an embodiment of the invention. As shown in fig. 3, the description of the concept "venue" in the HowNet database may be: no. (serial number) 129348, W _ C (concept Name) olympic, DEF (definition File Name Extension) { facilities | facility, domain { (sports | sports }, location { (place) } match | game, exercise }. In the conceptual description of "venue", KDML means: a venue is a facility, the Domain of which is the sports Domain, the Location of the competition (Location), and the Location of the exercise (Location). That is, a venue is a place for sporting events and fitness exercises.

Based on the above description of the reference database, it can be seen that a concept in the reference database may correspond to a word in at least one document, or may correspond to a plurality of words in at least one document. For example, the words "bowl", "occupation" and "errand" are words with similar meaning and represent the same concept. Therefore, the implementation manner of searching the reference database for two concepts corresponding to the two first-class words in the tuple a respectively may be: determining a word with the same word meaning as the first-class word B in at least one document for any first-class word B in two first-class words in the tuple A; and if the words with the same word senses as the first type words B do not exist in at least one document, searching the concept corresponding to the first type words B from the reference database.

If a word with the same word sense as the first word B exists in at least one document, determining the occurrence times of the first word B and the word with the same word sense as the first word B in the at least one document respectively; determining the words with the largest number of occurrences from the first-class words B and the words with the same word senses as the first-class words B; and finding out the concept corresponding to the word with the largest number of times from the reference database, and taking the found concept as the concept corresponding to the first class of words B.

When at least one document is divided into different partitions, since each partition executes step 102 according to the document divided into itself in parallel, in the above implementation, the term having the same meaning as the term B of the first-class term means: and the words with the same word senses as the first type words B in the subarea where the first type words B are located.

In addition, when the concept corresponding to the first-class word B is searched from the reference database in the above manner, there may be a case where the concept is not found, and at this time, a concept is created for the first-class word B by the LDA algorithm. The LDA algorithm is an important model for text semantic analysis.

When at least one document is divided into different partitions, because each partition executes the step 102 according to the divided document in parallel, at this time, each partition is configured with an LDA list, and for any partition, the LDA list is used for storing words of which corresponding concepts are not found from the reference database in the partition. Therefore, the implementation of creating a concept for the first category word B by the LDA algorithm may specifically be: the method comprises the steps of carrying out concept modeling on all words in a partitioned LDA list where a first word B is located to obtain a concept tree for the LDA list, adding the concept tree into a reference database, and searching a concept corresponding to the first word B from the reference database, which is equivalent to creating a concept for the first word B.

Step 1022: and according to the two searched concepts, continuously searching the concept relation corresponding to the second class of words in the multi-component group A from the reference database.

As can be seen from fig. 3, referring to more than one path between two different concepts in the database, there is more than one concept relationship on each path, and therefore, the implementation manner of step 1022 may specifically be: determining a path between the two searched concepts from the reference database to obtain a plurality of paths; selecting an entry label path from the multiple paths according to the path lengths of the multiple paths, wherein the target path comprises at least one concept relationship; and selecting the conceptual relationship with the maximum similarity between the second category of words in the multi-group A from the at least one conceptual relationship, and determining the selected conceptual relationship as the conceptual relationship corresponding to the second category of words in the multi-group A.

The implementation manner of selecting an entry label path from the multiple paths according to the path lengths of the multiple paths may be: and taking one path with the largest path length from the plurality of paths as a target path. Of course, the target path may also be determined according to other principles in the embodiment of the present invention, and the embodiment of the present invention is not specifically limited herein.

In addition, because there are some common verbs in the document, such as verbs "yes" and "yes", which do not contribute much to establishing the ontology, before determining the path between the two searched concepts from the reference database, it may also determine a TF-IDF (term frequency-inverse document frequency, a common weighting technique for information retrieval data mining) value of the second term included in the multi-tuple a, determine whether the second term is a common verb according to the TF-IDF value, if so, filter out the common verb, and at this time, no step of "determining the path between the two searched concepts from the reference database" is performed, and set the concept relationship corresponding to the second term in the multi-tuple a as a null value. If it is determined that the second category word is not a general verb according to the TF-IDF value, the above-mentioned "determining a path between two found concepts from the reference database" is performed to determine a concept relationship corresponding to the second category word in the tuple a from the reference database.

Through the

above steps

101 and 102, a plurality of semantic relation data sets can be obtained, and since each semantic relation data set includes two concepts and a concept relation between the two concepts, an ontology for at least one document can be established through the following step 103.

Step 103: and establishing an ontology for at least one document according to the plurality of semantic relation data groups.

In the embodiment of the invention, since each semantic relation data group comprises two concepts and concept relations between the two concepts, the mesh structure data indicated by a plurality of semantic relation data groups can be directly used as an ontology for at least one document.

Alternatively, when the

above steps

101 and 102 are performed by the partition mode, the same concept may exist between different partitions, which results in that the mesh structure data indicated by the multiple semantic relation data sets is relatively bulky and is not beneficial for other users to query information from the mesh structure data indicated by the multiple semantic relation data sets. Therefore, in the embodiment of the present invention, the mesh structure data indicated by the plurality of semantic relationship data sets may be further clipped to improve the generality of the established ontology.

The mesh structure data indicated by the plurality of semantic relationship data groups may be cut in the following manner: establishing a semantic relation graph according to a plurality of semantic relation data sets, wherein one node in the semantic relation graph corresponds to one concept in the plurality of semantic relation data sets, the relation between two nodes in the semantic relation graph is the concept relation between two corresponding concepts, the direction between the two nodes is the direction indicated by the concept relation between the two corresponding concepts, each node is configured with an in-degree and an out-degree, the in-degree of each node refers to the number of nodes pointing to the node, and the out-degree of each node refers to the number of nodes pointing to the node; for any first node with the in-degree equal to 0 and the out-degree greater than 0, cutting the first node; when all nodes with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relationship graph are cut, setting the in-degree of the nodes pointed by the nodes with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relationship graph to be 0; and returning to execute the step of cutting any first node with the degree of in-degree equal to 0 and the degree of out-degree greater than 0 for other nodes except the cut nodes until all nodes in the semantic relation graph are traversed, and taking the finally obtained semantic relation graph as an ontology established for at least one document.

The semantic relation graph is established according to the plurality of semantic relation data sets, wherein each concept in the plurality of semantic relation data sets is related according to the concept relation, and an obtained data system is called as the semantic relation graph.

In addition, when the semantic relation graph is cut, a breadth-first mode is adopted to traverse all nodes in the semantic relation graph. The breadth-first means that the nodes with the in-degree equal to 0 and the out-degree greater than 0 are processed first, and then the nodes directly pointed by the nodes with the in-degree equal to 0 and the out-degree greater than 0 are processed, and the specific process is implemented as described above. Of course, nodes in the semantic relationship graph may also be traversed in other traversal manners, such as a depth-first traversal manner, and the embodiment of the present invention is not limited in detail here.

In a possible implementation manner, the cutting of the first node may specifically be: determining at least one node in the semantic relation graph, wherein the document position of the word indicated by the concept corresponding to the at least one node in the at least one document is adjacent to the document position of the word indicated by the concept corresponding to the first node in the at least one document; and determining the node with the maximum connection degree with the first node in at least one node, deleting other nodes except the determined node in at least one node, and deleting the relationship between other nodes and the first node.

When at least one document is divided into different partitions, at least one node means that the words indicated by the corresponding concepts and the words indicated by the concepts corresponding to the first node are in the same partition.

In addition, the connectivity refers to a parameter for characterizing a degree of association between two nodes, and in the embodiment of the present invention, an implementation manner of determining a node with the greatest connectivity with a first node in at least one node may specifically be: and determining the connectivity between each node in the at least one node and the first node based on the first formula, and determining the node with the maximum connectivity with the first node in the at least one node according to the connectivity determined according to the first formula.

Wherein the first formula is:

wi and Wj are two nodes for determining the connectivity, respectively, Joint (Wi, Wj) represents the connectivity between the node Wi and the node Wj, Sim (Wi, Wj) is the similarity between the node Wi and the node Wj, Rel (Wi, Wj) is the correlation between the node Wi and the node Wj, α and β are a weighting coefficient configured for the similarity and a weighting coefficient configured for the correlation, respectively, and the sum of α and β is 1.

The similarity between the node Wi and the node Wj may be determined by the length of the path of the node Wi and the node Wj in the semantic relationship graph, and embodiments of the present invention are not described in detail herein. The degree of correlation between Wi and the node Wj may be determined by the number of occurrences of the term corresponding to the node Wi and the term corresponding to the node Wj in the at least one document. And when at least one document is divided into different partitions, the occurrence times of the words corresponding to the node Wi and the words corresponding to the node Wj in the at least one document are the occurrence times of the words corresponding to the node Wi and the words corresponding to the node Wj in the same partition. Therefore, the method for determining the connectivity provided by the embodiment of the invention can combine the connectivity with the context of the words and phrases and the similarity relation contained in the reference database, so that the determined concept relation is more consistent with the relation between two words and phrases.

In the above implementation manner, after the connectivity between each node in the at least one node and the first node is determined, the node with the maximum connectivity may be directly determined. However, in practical applications, after determining the connectivity between each node in the at least one node and the first node, the maximum connectivity may not meet the specified requirement, so that the node with the maximum connectivity directly determined according to the above implementation may not represent the node with the maximum association with the first node.

Therefore, optionally, if the maximum connectivity degree of the connectivity degrees determined according to the first formula is smaller than the connectivity degree threshold, the values of α and β are adjusted to obtain the first formula after updating; determining a degree of connectivity between each of the at least one node and the first node based on the updated first formula; and if the maximum connectivity in the connectivity determined according to the updated first formula is smaller than the connectivity threshold, returning to the step of adjusting the values of the alpha and the beta until the determined maximum connectivity is greater than or equal to the connectivity threshold, and determining the node corresponding to the maximum connectivity determined at the last time as the node with the maximum connectivity with the first node in the at least one node.

In addition, since the conceptual relationship corresponding to the second-class term in the multi-component group a may be null in step 102, there may be an isolated node in the semantic relationship graph with an in-degree equal to 0 and an out-degree equal to 0. For the isolated nodes, the isolated nodes can be firstly merged into other nodes, and then the semantic relation graph is cut. That is, before any first node with an in-degree equal to 0 and an out-degree greater than 0 is clipped, the following operations may be performed: if an isolated node with the in-degree equal to 0 and the out-degree equal to 0 exists in the semantic relation graph, configuring the isolated node to point to a node with the in-degree equal to 0 and the out-degree greater than 0 in the semantic relation graph to obtain an updated semantic relation graph; and executing the operation of clipping the first nodes for any first node with the in-degree equal to 0 and the out-degree greater than 0 based on the updated semantic relation graph. By this way of merging, it can be ensured that there is a relationship between each concept in the ontology built in step 102 and other concepts.

Fig. 4 is a body building apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 includes:

an obtaining module 401, configured to obtain at least one document used for establishing an ontology, and determine multiple tuple groups according to the at least one document, where each tuple group includes two first-class terms and one second-class term, the first-class terms are terms used for describing attributes of an object, and the second-class terms are terms used for indicating an association relationship between different objects;

a determining module 402, configured to determine two concepts corresponding to two first-class terms in each of the multiple tuples and a concept relationship corresponding to a second-class term in each of the multiple tuples, replace the two corresponding first-class terms with the determined two concepts, replace the corresponding second-class terms with the determined concept relationship, and obtain multiple semantic relationship data sets;

an establishing module 403, configured to establish an ontology for at least one document according to the plurality of semantic relationship data sets.

Optionally, the determining module 402 includes:

the first searching unit is used for searching two concepts respectively corresponding to two first-class terms in the multi-tuple A from a reference database for any one multi-tuple A in the multi-tuple A, wherein the reference database is used for describing the concept represented by each term in the multi-term and the concept relationship between different concepts;

and the second searching unit is used for continuously searching the concept relationship corresponding to the second class of words in the multi-component group A from the reference database according to the two searched concepts.

Optionally, the first search unit is specifically configured to:

determining a word with the same word meaning as the first-class word B in at least one document for any first-class word B in two first-class words in the tuple A;

and if the words with the same word senses as the first type words B do not exist in at least one document, searching the concept corresponding to the first type words B from the reference database.

Optionally, the first search unit is further specifically configured to:

if a word with the same word sense as the first word B exists in at least one document, determining the occurrence times of the first word B and the word with the same word sense as the first word B in the at least one document respectively;

and finding out the concept corresponding to the word with the largest number of times from the reference database, and taking the found concept as the concept corresponding to the first class of words B.

Optionally, the apparatus 400 further comprises:

Optionally, the second search unit is specifically configured to:

selecting an entry label path from the multiple paths according to the path lengths of the multiple paths, wherein the target path comprises at least one concept relationship;

and selecting the conceptual relationship with the maximum similarity between the second category of words in the multi-group A from the at least one conceptual relationship, and determining the selected conceptual relationship as the conceptual relationship corresponding to the second category of words in the multi-group A.

Optionally, the establishing module 403 includes:

the system comprises an establishing unit, a calculating unit and a calculating unit, wherein the establishing unit is used for establishing a semantic relation graph according to a plurality of semantic relation data sets, one node in the semantic relation graph corresponds to one concept in the plurality of semantic relation data sets, the relation between two nodes in the semantic relation graph is the concept relation between two corresponding concepts, the direction between the two nodes is the direction indicated by the concept relation between the two corresponding concepts, each node is configured with an in-degree and an out-degree, the in-degree of each node refers to the number of nodes pointing to each node, and the out-degree of each node refers to the number of nodes pointing to each node;

and the cutting unit is also used for returning and executing the step of cutting any first node with the in-degree equal to 0 and the out-degree greater than 0 for other nodes except the cut nodes until all nodes in the semantic relation graph are traversed, and taking the finally obtained semantic relation graph as an ontology established for at least one document.

Optionally, the clipping unit is specifically configured to:

determining at least one node in the semantic relation graph, wherein the document position of the word indicated by the concept corresponding to the at least one node in the at least one document is adjacent to the document position of the word indicated by the concept corresponding to the first node in the at least one document;

and determining the node with the maximum connection degree with the first node in at least one node, deleting other nodes except the determined node in at least one node, and deleting the relationship between other nodes and the first node.

Optionally, the clipping unit is further specifically configured to:

wherein the first formula is:

wi and Wj are two nodes used for determining connectivity respectively, Sim (Wi, Wj) is similarity between the node Wi and the node Wj, Rel (Wi, Wj) is correlation between the node Wi and the node Wj, alpha and beta are a weighting coefficient configured for the similarity and a weighting coefficient configured for the correlation respectively, and the sum of alpha and beta is 1;

Optionally, the clipping unit is further specifically configured to:

if the maximum connectivity in the connectivity determined according to the first formula is smaller than the connectivity threshold, adjusting the values of alpha and beta to obtain the updated first formula;

and if the maximum connectivity in the connectivity determined according to the updated first formula is smaller than the connectivity threshold, returning to the step of adjusting the values of the alpha and the beta until the determined maximum connectivity is greater than or equal to the connectivity threshold, and determining the node corresponding to the maximum connectivity determined at the last time as the node with the maximum connectivity with the first node in the at least one node.

Optionally, the establishing module further includes:

and the cutting unit is also used for executing the operation of cutting any first node with the in-degree equal to 0 and the out-degree greater than 0 based on the updated semantic relation graph.

Optionally, the obtaining module 401 includes:

the word segmentation processing unit is used for carrying out word segmentation processing on each document in at least one document to obtain a plurality of words;

a first determining unit configured to determine a part-of-speech of each of a plurality of words;

and the second determining unit is used for determining a plurality of multi-tuple according to the part of speech of each word in the plurality of words and the document position of each word in at least one document.

It should be noted that: in the body building apparatus provided in the above embodiment, when building the body, only the division of the above functional modules is used for illustration, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the embodiment of the body building apparatus and the embodiment of the body building method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 5 is a block diagram of a terminal 500 according to an embodiment of the present invention. The terminal 500 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the ontology-building method provided by method embodiments of the present invention.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch screen display 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 505 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side bezel of the terminal 500 and/or an underlying layer of the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically disposed on the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the touch display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the processor 501 controls the touch display screen 505 to switch from the screen-rest state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The embodiment of the present application further provides a non-transitory computer-readable storage medium, and when instructions in the storage medium are executed by a processor of a terminal, the terminal is enabled to execute the ontology establishing method provided in the foregoing embodiment.

The embodiment of the present application further provides a computer program product containing instructions, which when run on a terminal, causes the terminal to execute the ontology establishing method provided by the above embodiment.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. The server may be a server in a cluster of background servers. Specifically, the method comprises the following steps:

the server 600 includes a Central Processing Unit (CPU)601, a system memory 604 including a Random Access Memory (RAM)602 and a Read Only Memory (ROM)603, and a system bus 605 connecting the system memory 604 and the central processing unit 601. The server 600 also includes a basic input/output system (I/O system) 606, which facilitates the transfer of information between devices within the computer, and a mass storage device 607, which stores an operating system 613, application programs 614, and other program modules 615.

The basic input/output system 606 includes a display 608 for displaying information and an input device 609 such as a mouse, keyboard, etc. for user input of information. Wherein a display 608 and an input device 609 are connected to the central processing unit 601 through an input output controller 610 connected to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input/output controller 610 may also provide output to a display screen, a printer, or other type of output device.

The mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the server 600. That is, mass storage device 607 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 604 and mass storage device 607 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 600 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 600 may be connected to the network 612 through the network interface unit 611 connected to the system bus 605, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 611.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the ontology-building method provided by the above embodiments.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, where instructions in the storage medium are executed by a processor of a server, so that the server can execute the ontology establishing method provided in the foregoing embodiments.

Embodiments of the present invention further provide a computer program product including instructions, which, when running on a server, causes the server to execute the ontology establishing method provided in the foregoing embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An ontology establishing method, the method comprising:

determining two concepts corresponding to two first-class terms in each tuple in the multiple tuples and a concept relationship corresponding to a second-class term in each tuple, respectively replacing the two corresponding first-class terms with the determined two concepts, and replacing the corresponding second-class terms with the determined concept relationship to obtain multiple semantic relationship data sets;

2. The method of claim 1, wherein the determining two concepts corresponding to the two first-class terms in each of the plurality of tuples and the concept relationship corresponding to the second-class term in each tuple comprises:

3. The method of claim 2, wherein the searching the reference database for two concepts corresponding to the two first-class words in the tuple a comprises:

4. The method of claim 3, wherein after determining the words in the at least one document that have the same word sense as the first class of words B, further comprising:

5. The method of claim 3 or 4, further comprising:

and if the concept corresponding to the first category of words B is not found in the reference database, creating a concept for the first category of words B through a potential Dirichlet distribution (LDA) algorithm.

6. The method according to claim 2, wherein the step of continuing to search the reference database for the concept relationship corresponding to the second term in the tuple a according to the two searched concepts comprises:

7. The method of claim 1, wherein the building an ontology for the at least one document from the plurality of semantic relationship data sets comprises:

8. The method of claim 7, wherein said cropping the first node comprises:

9. The method of claim 8, wherein said determining the node of said at least one node having the greatest degree of connection to said first node comprises:

wherein the first formula is:

10. The method of claim 9, wherein said determining the node of said at least one node having the greatest connectivity to said first node based on the connectivity determined according to said first formula comprises:

11. The method of claim 7, wherein before clipping the first node for any first node having an in-degree equal to 0 and an out-degree greater than 0, further comprising:

12. The method of claim 1, wherein the determining a plurality of tuples from the at least one document comprises:

determining a part-of-speech of each of the plurality of words;

13. An ontology creating apparatus, the apparatus comprising:

the determining module is used for determining two concepts corresponding to two first-class terms in each tuple in the multiple tuples and a concept relationship corresponding to a second-class term in each tuple, replacing the two corresponding first-class terms with the two determined concepts, and replacing the corresponding second-class terms with the determined concept relationship to obtain multiple semantic relationship data sets;

14. The apparatus of claim 13, wherein the determining module comprises:

15. The apparatus of claim 14, wherein the first lookup unit is specifically configured to:

16. The apparatus of claim 15, wherein the first lookup unit is further specifically configured to:

17. The apparatus of claim 15 or 16, wherein the apparatus further comprises:

a creating unit, configured to create a concept for the first category word B through a potential dirichlet distribution LDA algorithm if the concept corresponding to the first category word B is not found in the reference database.

18. The apparatus of claim 14, wherein the second lookup unit is specifically configured to:

19. The apparatus of claim 13, wherein the establishing module comprises:

20. The apparatus of claim 19, wherein the clipping unit is specifically configured to:

21. The apparatus of claim 20, wherein the clipping unit is further specifically configured to:

wherein the first formula is:

22. The apparatus of claim 21, wherein the clipping unit is further specifically configured to:

23. The apparatus of claim 19, wherein the establishing module further comprises:

24. The apparatus of claim 13, wherein the acquisition module comprises:

25. An ontology creating apparatus, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of the method of any of the above claims 1 to 12.

26. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, carry out the steps of the method of any of the preceding claims 1 to 12.