WO2021040089A1

WO2021040089A1 - Method for expanding ontology data in heterogenous topic document on basis of image similarity

Info

Publication number: WO2021040089A1
Application number: PCT/KR2019/011054
Authority: WO
Inventors: 이대희; 이준성; 백인호
Original assignee: 주식회사 테크플럭스
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2021-03-04

Abstract

Topic modelling is used to classify words with regard to each topic. If a user configures major words and a hierarchy in a specific topic, parse tree information is produced, thereby extracting words similar to the hierarchy of the major words configured by the user. The extracted words are used to extract word embedding, and the similarity between the major words selected by the user and the word embedding is used to extract words having a high degree of relation. This series of processes is repeated to extend ontology (semantic-graph) information having hierarchy and relation information regarding words of interest of the user. The word embedding extraction method may be changed to a parse tree method. In this case, the hierarchy of words extracted by parse analysis is used such that, by applying an SVD method, the ontology information selected by the user is expanded. In addition, an area of interest of a representative image is selected from the expanded ontology information, and a document having similar image information is extracted from a heterogenous topic document, thereby expanding the user ontology information.

Description

Image similarity-based, method of expanding ontology data in heterogeneous topic documents

The present invention relates to a method of extending ontology information (ontology, sematic graph) representing the hierarchical structure and connection information of a key word selected by a user by using a topic modeling method in the field of natural language processing and image processing analysis. to be.

By extracting feature vectors from text and image information, each information can be classified for each topic. For this topic modeling, SVD (singular value decomposition) and LDA (latent dirichlet allocation) methods can be applied.

Ontology information can be configured by using relational and hierarchical information between each main information by using text and image information classified by topic modeling.

In the US9892194 patent, a method of determining the hierarchical structure by applying weights based on important paragraphs was proposed, in the US9449051 patent, the hierarchical structure was determined by considering the frequency of occurrence of words by topic, and the US10216829 patent calculated topic probability for a group of words I did.

The present invention is a method of extending ontology (knowledge-graph, sematic-graph) information that satisfies hierarchy with high connection to a word of interest of a user in a large number of documents. It is also a method of topic modeling using both text and image information. The purpose of this study is to propose a method of expanding ontology (knowledge-graph, sematic-graph) information that satisfies the hierarchy information (hierarchy) with high connectivity for the user's interest word and image region of interest in a big data document.

Words can be classified for each topic by using topic modeling of big data document information. When a user composes key words and hierarchical information in a specific topic, the system creates parse tree information and extracts words similar to hierarchical information of the key words constructed by the user. Using the extracted words, word vector information (word embedding) is extracted, and words with high connectivity are extracted by using the similarity between the main words selected by the user and word vector information. By repeating this series of processes, ontology (sematic-graph) information having hierarchical structure and connectivity information for the word of interest of the user is expanded. The method of extracting the word vector information may be changed to a parse tree method. In this case, the ontology information selected by the user can be expanded by using the hierarchical information of the word extracted by syntax analysis. In addition, user ontology information can be extended by selecting a region of interest of a representative image from the extended ontology information and extracting a document having similar image information from a heterogeneous topic document.

The importance of words is determined according to the user's main words of interest, and by reflecting the hierarchical information and connection information between words defined by the user, effectively expanding the words of interest, hierarchical information, and words with high connectivity, Ontology information can be created. In addition, a region of interest of a representative image is selected from the extended ontology information, and a document having similar image information is extracted from a heterogeneous topic document, and user ontology information is expanded.

1 is a conceptual diagram for topic modeling of the present invention.

2 shows an SVD matrix according to an embodiment of the present invention.

3 shows a method of extracting words with high similarity by using word vector information and expanding ontology information by reflecting hierarchical information of a word of interest of a user according to an embodiment of the present invention.

4 is a reflection of the hierarchical information of the user interest word according to an embodiment of the present invention. This shows a method of extending ontology information by applying the SVD method to syntax tree information and extracting words with high connectivity of hierarchical information.

5 is an example of a syntax tree according to an embodiment of the present invention.

6 is a method for expanding user ontology information in documents of different topics by using an image of a document including extended ontology information according to an embodiment of the present invention.

The terms used in the present specification will be briefly described, and the present invention will be described in detail. Terms used in the present invention have selected general terms that are currently widely used as possible while taking functions of the present invention into consideration, but this may vary according to the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall contents of the present invention, not a simple name of the term.

When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. .

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

1 is a conceptual diagram for topic modeling of the present invention. All words in each document are categorized according to user-defined topics. Therefore, words related to various topics can be distributed in one document.

As a topic classification method, SVD (Singular Value Decomposition) or LDA (Latent Dirichlet allocation) is used. SVD, also referred to as singular value decomposition, is one of the matrix decomposition methods. In addition to lowering the computational cost, data with low information can be deleted and compressed to confirm in-depth meaning that was not revealed in the existing data. 2 shows an SVD matrix according to an embodiment of the present invention. The equation below is the determinant of SVD.

U and V are orthogonal matrices, and S is a rectangular matrix.

5 is an example of a syntax tree according to an embodiment of the present invention. The syntax tree analyzes the morpheme of a sentence through syntax analysis, and analyzes it in the form of word parts of speech and sentence components. Therefore, using the information of the phrase tree, it is possible to analyze the hierarchical information and connectivity of words in a sentence.

3 shows a method of extracting words with high similarity, and expanding ontology information by using word vector information by reflecting hierarchical information of a user's interest word according to an embodiment of the present invention.

In step S310 of FIG. 3, the number of topic models is determined for topic modeling.

In step S320, a topic extracted by a topic modeling method is selected, and words included in the topic are selected. In this case, a word to be used may be selected using the importance of each word or probability information to be included in the topic of each word. Therefore, the flowchart of FIG. 3 is performed for each topic.

Step S330 is a process in which the user selects a word of interest. In this case, the user may select one word or a group of words as the word of interest. In step S340, the user defines hierarchical information (syntax tree 1) and connection information of the main word of interest. Connection information can be expressed in various semantic connection relationships between two words.

In step S340, the main word of interest and the hierarchical information and connection information between the corresponding word may be differently defined for each user. Therefore, it is a step of setting the search direction of the system by defining the word of interest and ontology information of the user.

In step S350, the phrase tree-B having hierarchical information similar to the phrase tree-A defined in step S340 is extracted by using sentences containing the word of interest of the user. At this time, in step S360, a valid word is determined by using the similarity of the hierarchical information of the syntax tree-A and the syntax tree-B2 and the word distance information between the user's interest word on the syntax tree.

Step S370 is a step of generating word vector information (word2vec, word-embedding) using words of the extracted valid syntax information. The word vector information represents similarity information of all words as a value between -1 and 1, or between 0 and 1. The higher the similarity value, the higher the similarity between the two words. Using the word vector information extracted in step S370, it is possible to extract word sets having high similarity to the word of interest of the user. In this case, valid word sets can be determined by a statistical processing method. N words with high similarity to each user's interest word are selected and extracted as a group of valid similar words. The number of effective similar word sets is determined by the size of the word of interest. Statistical values such as mean-A, standard deviation-A, and variance-A of each effective similar word set are extracted, and the sum of the deviations of each effective similar word is small by using the information of each effective similar word and the mean-A. Groups of valid similar words can be scored in order. In addition, the effective similar word group can be adjusted so that the deviation of the variance value between each effective similar word group has a value lower than a predetermined standard. Therefore, the deviation of the variance value and the variance value may be used as a user setting value for determining a valid similar word . These user setting values may be changed and applied every time step S370 is performed. In addition, when there are multiple variance values of a group of effective similar words as described above, it is possible to determine whether or not a certain significance probability is satisfied through the variance test (ANOVA) method.

In step S380, the effective similar word selected in S370 is included in the ontology composed of the user's interest word. When step S380 is completed, step S330 is performed again based on the expanded ontology information. In the case of repeating step S330 again, the step of determining by the user to include the valid similar word in the user ontology in step S380 may be included, or if the statistical value of the valid similar word satisfies the threshold value, it is included in the user ontology. , It may be set to automatically repeat step S330. In addition, when step S340 is performed again, the updated user ontology information may be used, or the user may modify the ontology information. Therefore, when steps S330 to S380 are repeatedly performed, ontology information defined by a user is continuously updated, and a new valid similar word is updated with word vector information. In addition, the repeating steps S330-S380 may be repeatedly performed until no more effective similar words are extracted.

In FIG. 3, step S360 may branch to step G1. The branching condition is a case in which an ontology extended word is determined using word hierarchical information of the syntax tree determined in step S360, in addition to a method having high similarity to the word of interest by using word vector information.

Step S410 is a step of expressing a valid word existing within a certain distance from the user interest word in the hierarchical information of the syntax tree as matrix information. The matrix value is configured in the form of an inverse number of distance information of valid words. The shorter the distance between the user's interest word and the valid word in the syntax tree, the higher the connectivity, so the smaller the distance information is, the more effective it is.

Before constructing a valid matrix, a network analysis method can be applied by using link information between valid words existing in a document. Valid words can be selectively extracted based on the link score extracted from the network information. Accordingly, in step S410, in order to extract the user valid word, the distance between the user interest word and the valid word candidate and the link score or ranking of network information may be performed with reference to a user set value. In addition, such a user setting value may be changed and applied each time step S360 or step S410 is performed.

In step S420, all matrix information generated in step S410 is used, grouped into all matrices, and the SVD method is applied thereto.

In step S430, a truncated SVD (truncated SVD) is selected using only the upper value of the diagonal matrix (eigenvalues) obtained through the SVD in step S420. In this case, in step S440, the distribution of the effective matrix values of the selected SVD and information on the diagonal matrix may be used to separate them into a plurality of subgroups. In addition, step S440 is a step of extracting related word (entity) information from the hierarchical information of the selected SVD including a plurality of subgroups determined in step S430.

In step S450, the ontology information composed of the user's interest word and the selected SVD subgroup with high connectivity are combined. Key words and hierarchical information included in the selected SVD subgroup can be used to combine with user ontology information. In this case, the selected SVD subgroup may be combined using distance information between the closest user interest words, or the user may designate a word of interest to which the selected SVD subgroup is to be combined. In step S450, by combining the selected SVD subgroup consisting of one or more words with the user interest word, the user can more clearly understand the meaning of the recommended word group and can easily determine the expansion of user ontology information. If step S450 is performed, step S330 of FIG. 3 is performed again using the expanded user ontology information. Therefore, as steps S330 to S450 are repeatedly performed, when the user ontology information is continuously expanded and no new valid word candidates are generated, it may be terminated or the user may intentionally terminate the repeating step.

6 is a method of extracting a document having similar image information from another topic document based on a document extending user ontology information. This is a method of extracting documents that are not extracted from the text-based topic model using images.

In step S380, it branches to G2 and processes the data. The condition for branching to G2 is a method applied when there is a similarity between a certain number or more of image information in the expanded ontology document. In addition, even if there is no similar image with a certain frequency, it can be branched to G2 by the user's decision.

This is a step of analyzing the similarity of the image extracted from topic document-A in step S610 of FIG. 6. In step S620, a representative drawing is determined in the topic document. In this case, the entire image area is scanned for the image pixel window, and feature points are extracted. In step S630, the system determines the main object by designating the main object of the representative drawing or using the feature points extracted in S620. When the main object of the representative drawing is selected, S640 is a process of extracting the topic document-C having a similar image from the topic document-B composed of the topic document-A and other topic documents. In S640, the topic modeling step of S310 of FIG. 3 is performed again using the topic document-C extracted from different topic documents.

Claims

In the ontology information expansion method,

Extracting, by a system, a topic document-A group including the ontology information;

Determining, by the system, a representative image of the topic document-A group;

Receiving, by a system, a feature area of the representative-image;

Determining, by the system, a topic document-B having an image feature region similar to that of the representative-image , from document groups classified into different topics ; And

In order for the system to expand the above-user ontology,

An ontology information extension method in which a system performs topic modeling using the extracted topic document-B.
The method of claim 1,

In performing topic modeling using the topic document-B,

Ontology information expansion method further comprising the step of the system receiving the number of topic models.
In the ontology information expansion method,

Receiving, by the system, user-selected words included in the same topic;

Receiving, by a system, hierarchical information and connection information (user-ontology) between the user-selected words;

Generating, by a system, a syntax tree from a document included in the topic using the user-ontology;

Determining, by the system, a valid-word using the syntax tree information; And

In order for the system to expand the above-user ontology,

An ontology expansion method comprising the step of: receiving, by a system, word vector information application determination or selection-SVD application determination information.
The method of claim 1,

In order to determine the valid-word in the step of determining the syntax tree,

The method of extending ontology information, further comprising the step of: receiving, by the system, distance information from the user-ontology.
The method of claim 1,

In order to determine a valid-similar word by applying the word vector information,

The method of extending ontology information, further comprising the step of: receiving, by the system, information on the variance value and the deviation information between the variance values.
The method of claim 1,

By applying the screening-SVD, to determine the screening-SVD-subgroup,

Ontology information expansion method further comprising the step of the system receiving the number of screening-SVD-subgroups.