WO2021040089A1 - Method for expanding ontology data in heterogenous topic document on basis of image similarity - Google Patents

Method for expanding ontology data in heterogenous topic document on basis of image similarity Download PDF

Info

Publication number
WO2021040089A1
WO2021040089A1 PCT/KR2019/011054 KR2019011054W WO2021040089A1 WO 2021040089 A1 WO2021040089 A1 WO 2021040089A1 KR 2019011054 W KR2019011054 W KR 2019011054W WO 2021040089 A1 WO2021040089 A1 WO 2021040089A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
ontology
user
topic
words
Prior art date
Application number
PCT/KR2019/011054
Other languages
French (fr)
Korean (ko)
Inventor
이대희
이준성
백인호
Original Assignee
주식회사 테크플럭스
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 테크플럭스 filed Critical 주식회사 테크플럭스
Priority to PCT/KR2019/011054 priority Critical patent/WO2021040089A1/en
Publication of WO2021040089A1 publication Critical patent/WO2021040089A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to a method of extending ontology information (ontology, sematic graph) representing the hierarchical structure and connection information of a key word selected by a user by using a topic modeling method in the field of natural language processing and image processing analysis. to be.
  • ontology information ontology, sematic graph
  • each information can be classified for each topic.
  • SVD singular value decomposition
  • LDA latent dirichlet allocation
  • Ontology information can be configured by using relational and hierarchical information between each main information by using text and image information classified by topic modeling.
  • the present invention is a method of extending ontology (knowledge-graph, sematic-graph) information that satisfies hierarchy with high connection to a word of interest of a user in a large number of documents. It is also a method of topic modeling using both text and image information.
  • the purpose of this study is to propose a method of expanding ontology (knowledge-graph, sematic-graph) information that satisfies the hierarchy information (hierarchy) with high connectivity for the user's interest word and image region of interest in a big data document.
  • Words can be classified for each topic by using topic modeling of big data document information.
  • the system creates parse tree information and extracts words similar to hierarchical information of the key words constructed by the user.
  • word vector information word embedding
  • words with high connectivity are extracted by using the similarity between the main words selected by the user and word vector information.
  • ontology sematic-graph
  • the method of extracting the word vector information may be changed to a parse tree method.
  • the ontology information selected by the user can be expanded by using the hierarchical information of the word extracted by syntax analysis.
  • user ontology information can be extended by selecting a region of interest of a representative image from the extended ontology information and extracting a document having similar image information from a heterogeneous topic document.
  • Ontology information The importance of words is determined according to the user's main words of interest, and by reflecting the hierarchical information and connection information between words defined by the user, effectively expanding the words of interest, hierarchical information, and words with high connectivity, Ontology information can be created.
  • a region of interest of a representative image is selected from the extended ontology information, and a document having similar image information is extracted from a heterogeneous topic document, and user ontology information is expanded.
  • 1 is a conceptual diagram for topic modeling of the present invention.
  • FIG. 2 shows an SVD matrix according to an embodiment of the present invention.
  • FIG. 3 shows a method of extracting words with high similarity by using word vector information and expanding ontology information by reflecting hierarchical information of a word of interest of a user according to an embodiment of the present invention.
  • FIG. 4 is a reflection of the hierarchical information of the user interest word according to an embodiment of the present invention. This shows a method of extending ontology information by applying the SVD method to syntax tree information and extracting words with high connectivity of hierarchical information.
  • FIG. 5 is an example of a syntax tree according to an embodiment of the present invention.
  • FIG. 6 is a method for expanding user ontology information in documents of different topics by using an image of a document including extended ontology information according to an embodiment of the present invention.
  • 1 is a conceptual diagram for topic modeling of the present invention. All words in each document are categorized according to user-defined topics. Therefore, words related to various topics can be distributed in one document.
  • SVD Single Value Decomposition
  • LDA Topic Dirichlet allocation
  • U and V are orthogonal matrices
  • S is a rectangular matrix
  • the syntax tree analyzes the morpheme of a sentence through syntax analysis, and analyzes it in the form of word parts of speech and sentence components. Therefore, using the information of the phrase tree, it is possible to analyze the hierarchical information and connectivity of words in a sentence.
  • FIG. 3 shows a method of extracting words with high similarity, and expanding ontology information by using word vector information by reflecting hierarchical information of a user's interest word according to an embodiment of the present invention.
  • step S310 of FIG. 3 the number of topic models is determined for topic modeling.
  • step S320 a topic extracted by a topic modeling method is selected, and words included in the topic are selected.
  • a word to be used may be selected using the importance of each word or probability information to be included in the topic of each word. Therefore, the flowchart of FIG. 3 is performed for each topic.
  • Step S330 is a process in which the user selects a word of interest.
  • the user may select one word or a group of words as the word of interest.
  • the user defines hierarchical information (syntax tree 1) and connection information of the main word of interest. Connection information can be expressed in various semantic connection relationships between two words.
  • step S340 the main word of interest and the hierarchical information and connection information between the corresponding word may be differently defined for each user. Therefore, it is a step of setting the search direction of the system by defining the word of interest and ontology information of the user.
  • step S350 the phrase tree-B having hierarchical information similar to the phrase tree-A defined in step S340 is extracted by using sentences containing the word of interest of the user.
  • step S360 a valid word is determined by using the similarity of the hierarchical information of the syntax tree-A and the syntax tree-B2 and the word distance information between the user's interest word on the syntax tree.
  • Step S370 is a step of generating word vector information (word2vec, word-embedding) using words of the extracted valid syntax information.
  • the word vector information represents similarity information of all words as a value between -1 and 1, or between 0 and 1. The higher the similarity value, the higher the similarity between the two words.
  • word vector information extracted in step S370 it is possible to extract word sets having high similarity to the word of interest of the user.
  • valid word sets can be determined by a statistical processing method. N words with high similarity to each user's interest word are selected and extracted as a group of valid similar words. The number of effective similar word sets is determined by the size of the word of interest.
  • Statistical values such as mean-A, standard deviation-A, and variance-A of each effective similar word set are extracted, and the sum of the deviations of each effective similar word is small by using the information of each effective similar word and the mean-A.
  • Groups of valid similar words can be scored in order.
  • the effective similar word group can be adjusted so that the deviation of the variance value between each effective similar word group has a value lower than a predetermined standard. Therefore, the deviation of the variance value and the variance value may be used as a user setting value for determining a valid similar word .
  • These user setting values may be changed and applied every time step S370 is performed.
  • ANOVA variance test
  • step S380 the effective similar word selected in S370 is included in the ontology composed of the user's interest word.
  • step S330 is performed again based on the expanded ontology information.
  • the step of determining by the user to include the valid similar word in the user ontology in step S380 may be included, or if the statistical value of the valid similar word satisfies the threshold value, it is included in the user ontology. , It may be set to automatically repeat step S330.
  • step S340 is performed again, the updated user ontology information may be used, or the user may modify the ontology information.
  • steps S330 to S380 are repeatedly performed, ontology information defined by a user is continuously updated, and a new valid similar word is updated with word vector information.
  • the repeating steps S330-S380 may be repeatedly performed until no more effective similar words are extracted.
  • step S360 may branch to step G1.
  • the branching condition is a case in which an ontology extended word is determined using word hierarchical information of the syntax tree determined in step S360, in addition to a method having high similarity to the word of interest by using word vector information.
  • Step S410 is a step of expressing a valid word existing within a certain distance from the user interest word in the hierarchical information of the syntax tree as matrix information.
  • the matrix value is configured in the form of an inverse number of distance information of valid words. The shorter the distance between the user's interest word and the valid word in the syntax tree, the higher the connectivity, so the smaller the distance information is, the more effective it is.
  • a network analysis method can be applied by using link information between valid words existing in a document.
  • Valid words can be selectively extracted based on the link score extracted from the network information. Accordingly, in step S410, in order to extract the user valid word, the distance between the user interest word and the valid word candidate and the link score or ranking of network information may be performed with reference to a user set value. In addition, such a user setting value may be changed and applied each time step S360 or step S410 is performed.
  • step S420 all matrix information generated in step S410 is used, grouped into all matrices, and the SVD method is applied thereto.
  • step S430 a truncated SVD (truncated SVD) is selected using only the upper value of the diagonal matrix (eigenvalues) obtained through the SVD in step S420.
  • step S440 the distribution of the effective matrix values of the selected SVD and information on the diagonal matrix may be used to separate them into a plurality of subgroups.
  • step S440 is a step of extracting related word (entity) information from the hierarchical information of the selected SVD including a plurality of subgroups determined in step S430.
  • step S450 the ontology information composed of the user's interest word and the selected SVD subgroup with high connectivity are combined. Key words and hierarchical information included in the selected SVD subgroup can be used to combine with user ontology information.
  • the selected SVD subgroup may be combined using distance information between the closest user interest words, or the user may designate a word of interest to which the selected SVD subgroup is to be combined.
  • step S450 by combining the selected SVD subgroup consisting of one or more words with the user interest word, the user can more clearly understand the meaning of the recommended word group and can easily determine the expansion of user ontology information. If step S450 is performed, step S330 of FIG. 3 is performed again using the expanded user ontology information. Therefore, as steps S330 to S450 are repeatedly performed, when the user ontology information is continuously expanded and no new valid word candidates are generated, it may be terminated or the user may intentionally terminate the repeating step.
  • 6 is a method of extracting a document having similar image information from another topic document based on a document extending user ontology information. This is a method of extracting documents that are not extracted from the text-based topic model using images.
  • step S380 it branches to G2 and processes the data.
  • the condition for branching to G2 is a method applied when there is a similarity between a certain number or more of image information in the expanded ontology document. In addition, even if there is no similar image with a certain frequency, it can be branched to G2 by the user's decision.
  • step S620 a representative drawing is determined in the topic document. In this case, the entire image area is scanned for the image pixel window, and feature points are extracted.
  • step S630 the system determines the main object by designating the main object of the representative drawing or using the feature points extracted in S620.
  • S640 is a process of extracting the topic document-C having a similar image from the topic document-B composed of the topic document-A and other topic documents.
  • the topic modeling step of S310 of FIG. 3 is performed again using the topic document-C extracted from different topic documents.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Topic modelling is used to classify words with regard to each topic. If a user configures major words and a hierarchy in a specific topic, parse tree information is produced, thereby extracting words similar to the hierarchy of the major words configured by the user. The extracted words are used to extract word embedding, and the similarity between the major words selected by the user and the word embedding is used to extract words having a high degree of relation. This series of processes is repeated to extend ontology (semantic-graph) information having hierarchy and relation information regarding words of interest of the user. The word embedding extraction method may be changed to a parse tree method. In this case, the hierarchy of words extracted by parse analysis is used such that, by applying an SVD method, the ontology information selected by the user is expanded. In addition, an area of interest of a representative image is selected from the expanded ontology information, and a document having similar image information is extracted from a heterogenous topic document, thereby expanding the user ontology information.

Description

이미지 유사성 기반, 이종 토픽 문서에서 온톨로지 데이터 확장방법Image similarity-based, method of expanding ontology data in heterogeneous topic documents
본 발명은 자연어 처리 및 이미지 처리 분석 분야에서 토픽모델링 (topic modelling) 방법을 이용하여, 사용자가 선택한 주요 단어의 계층구조 및 연결정보를 나타내는 온톨로지 정보(ontology, sematic graph)를 확장하는 방법에 관한 기술이다.The present invention relates to a method of extending ontology information (ontology, sematic graph) representing the hierarchical structure and connection information of a key word selected by a user by using a topic modeling method in the field of natural language processing and image processing analysis. to be.
텍스트 및 이미지 정보에서 특징벡터를 추출하여 각 정보를 토픽별로 분류할 수 있다. 이러한 토픽모델링을 위해서는 SVD (singular value decomposition), LDA (latent dirichlet allocation )방법을 적용할 수 있다.By extracting feature vectors from text and image information, each information can be classified for each topic. For this topic modeling, SVD (singular value decomposition) and LDA (latent dirichlet allocation) methods can be applied.
토픽모델링으로 분류된 텍스트 및 이미지 정보를 이용하여 각 주요 정보 사이에 연결정보(relation) 및 계층정보(hierarchy)를 이용하여 온톨로지 정보를 구성할 수 있다. Ontology information can be configured by using relational and hierarchical information between each main information by using text and image information classified by topic modeling.
US9892194 특허에서는 중요 문단을 기준으로 가중치를 적용, 계층구조를 결정하는 방법을 제시하였고, US9449051 특허에서는 토픽별로 단어의 출현빈도를 고려하여 계층구조 결정하였고, US10216829 특허는 단어 그룹에 대해서 토픽확률을 계산하였다. In the US9892194 patent, a method of determining the hierarchical structure by applying weights based on important paragraphs was proposed, in the US9449051 patent, the hierarchical structure was determined by considering the frequency of occurrence of words by topic, and the US10216829 patent calculated topic probability for a group of words I did.
본 발명은 대량의 문서에서 사용자의 관심 단어에 대해 연결성 (relation) 이 높고 계층정보(hierarchy)까지 만족하는 온톨로지(ontology, knowledge-graph, sematic-graph) 정보를 확장하는 방법이다. 또한 텍스트와 이미지 정보를 모두 사용하여 토픽모델링하는 방법이다. 빅데이터 문서에서 사용자의 관심 단어 및 이미지 관심영역에 대해 연결성 (relation) 이 높고 계층정보(hierarchy)까지 만족하는 온톨로지(ontology, knowledge-graph, sematic-graph) 정보를 확장하는 방법을 제시하고자 한다.The present invention is a method of extending ontology (knowledge-graph, sematic-graph) information that satisfies hierarchy with high connection to a word of interest of a user in a large number of documents. It is also a method of topic modeling using both text and image information. The purpose of this study is to propose a method of expanding ontology (knowledge-graph, sematic-graph) information that satisfies the hierarchy information (hierarchy) with high connectivity for the user's interest word and image region of interest in a big data document.
빅데이터 문서 정보를 토픽모델링(topic modelling) 이용하여, 토픽별로 단어를 분류할 수 있다. 사용자가 특정 토픽에서 주요 단어 및 계층정보를 구성하면, 시스템은 구문트리(parse tree) 정보를 작성하여, 사용자가 구성한 주요 단어의 계층정보와 유사한 단어들을 추출한다. 추출된 단어를 이용, 단어벡터정보(word embedding)를 추출, 사용자가 선택한 주요 단어들과 단어벡터정보의 유사성(similarity)을 이용하여 연결성이 높은 단어를 추출한다. 이러한 일련의 과정을 반복하여, 사용자의 관심 단어에 대해 계층구조와 연결성 정보를 갖는 온톨로지(ontology, sematic-graph) 정보를 확장한다. 상기 단어벡터정보를 추출하는 방법은 구문분석(parse tree) 방법으로 변경될 수 있다. 이 때에는 구문분석으로 추출된 단어의 계층정보를 이용하여 사용자가 선택한 온톨로지 정보를 확장할 수 있다. 또한 확장 온톨로지 정보에서 대표 이미지의 관심영역을 선정, 유사한 이미지 정보를 갖는 문서를 이종 토픽문서에서 추출하여, 사용자 온톨로지 정보를 확장할 수 있다.Words can be classified for each topic by using topic modeling of big data document information. When a user composes key words and hierarchical information in a specific topic, the system creates parse tree information and extracts words similar to hierarchical information of the key words constructed by the user. Using the extracted words, word vector information (word embedding) is extracted, and words with high connectivity are extracted by using the similarity between the main words selected by the user and word vector information. By repeating this series of processes, ontology (sematic-graph) information having hierarchical structure and connectivity information for the word of interest of the user is expanded. The method of extracting the word vector information may be changed to a parse tree method. In this case, the ontology information selected by the user can be expanded by using the hierarchical information of the word extracted by syntax analysis. In addition, user ontology information can be extended by selecting a region of interest of a representative image from the extended ontology information and extracting a document having similar image information from a heterogeneous topic document.
사용자의 주요 관심 단어에 따라, 상대적으로 단어의 중요도가 결정되며, 또한 사용자가 정의한 단어간의 계층정보 및 연결성 정보를 반영하여, 관심 단어와 계층정보 및 연결성이 높은 단어를 효과적으로 확장함으로써, 사용자 관점의 온톨로지 정보를 생성이 가능하다. 또한 확장 온톨로지 정보에서 대표 이미지의 관심영역을 선정, 유사한 이미지 정보를 갖는 문서를 이종 토픽문서에서 추출하여, 사용자 온톨로지 정보를 확장한다.The importance of words is determined according to the user's main words of interest, and by reflecting the hierarchical information and connection information between words defined by the user, effectively expanding the words of interest, hierarchical information, and words with high connectivity, Ontology information can be created. In addition, a region of interest of a representative image is selected from the extended ontology information, and a document having similar image information is extracted from a heterogeneous topic document, and user ontology information is expanded.
도 1은 본 발명의 토픽모델링에 대한 개념도이다.1 is a conceptual diagram for topic modeling of the present invention.
도 2는 본 발명의 실시예에 따른 SVD 행렬을 나타낸다.2 shows an SVD matrix according to an embodiment of the present invention.
도 3은 본 발명의 실시예에 따른 사용자 관심 단어의 계층정보를 반영, 단어벡터정보를 이용하여 유사성이 높은 단어를 추출, 온톨로지 정보를 확장하는 방법을 나타낸다.3 shows a method of extracting words with high similarity by using word vector information and expanding ontology information by reflecting hierarchical information of a word of interest of a user according to an embodiment of the present invention.
도 4는 본 발명의 실시예에 따른 사용자 관심 단어의 계층정보를 반영. 구문트리 정보에 대해 SVD 방법을 적용, 계층정보의 연결성이 높은 단어들을 추출하여 온톨로지 정보를 확장하는 방법을 나타낸다.4 is a reflection of the hierarchical information of the user interest word according to an embodiment of the present invention. This shows a method of extending ontology information by applying the SVD method to syntax tree information and extracting words with high connectivity of hierarchical information.
도 5는 본 발명의 실시예에 따른 구문트리의 예이다.5 is an example of a syntax tree according to an embodiment of the present invention.
도 6는 본 발명의 실시예에 따른 확장된 온톨로지 정보를 포함하는 문서의 이미지를 이용, 서로 다른 토픽의 문서에서 사용자 온톨로지 정보를 확장하는 방법이다.6 is a method for expanding user ontology information in documents of different topics by using an image of a document including extended ontology information according to an embodiment of the present invention.
본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다. 본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당하는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present specification will be briefly described, and the present invention will be described in detail. Terms used in the present invention have selected general terms that are currently widely used as possible while taking functions of the present invention into consideration, but this may vary according to the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall contents of the present invention, not a simple name of the term.
명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit" and "module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. .
아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시 예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.
이하 첨부된 도면을 참고하여 본 발명을 상세히 설명하기로 한다Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.
도 1은 본 발명의 토픽모델링에 대한 개념도이다. 각 문서의 모든 단어는 사용자가 정의 토픽에 따라 분류된다. 따라서 하나의 문서에는 다양한 토픽과 관련된 단어들이 분포할 수 있다. 1 is a conceptual diagram for topic modeling of the present invention. All words in each document are categorized according to user-defined topics. Therefore, words related to various topics can be distributed in one document.
토픽 분류 방법으로 SVD(Singular Value Decomposition), 또는 LDA (Latent Dirichlet allocation) 방법을 사용한다. 특이값 분해라고도 하는 SVD는 행렬 분해(Decomposition) 방법 중 하나이며, 계산 비용이 낮아지는 것 이외에도 정보가 낮은 데이터를 삭제 및 압축하여 기존 데이터에서 드러나지 않았던 심층적인 의미를 확인할 수 있다. 도 2는 본 발명의 실시예에 따른 SVD 행렬을 나타낸다. 아래 식은 SVD의 행렬식이다. As a topic classification method, SVD (Singular Value Decomposition) or LDA (Latent Dirichlet allocation) is used. SVD, also referred to as singular value decomposition, is one of the matrix decomposition methods. In addition to lowering the computational cost, data with low information can be deleted and compressed to confirm in-depth meaning that was not revealed in the existing data. 2 shows an SVD matrix according to an embodiment of the present invention. The equation below is the determinant of SVD.
Figure PCTKR2019011054-appb-I000001
Figure PCTKR2019011054-appb-I000001
U와 V는 직교행렬(Orthogonal Matrix)이고 S는 대각 행렬(Rectangular Diagonal Matrix)이다.U and V are orthogonal matrices, and S is a rectangular matrix.
도 5는 본 발명의 실시예에 따른 구문트리의 예이다. 구문트리는 구문분석을 통해 문장의 형태소를 분석하여, 단어 품사 및 문장성분 형태로 분석한다. 따라서 구문트리의 정보를 이용하여, 문장상의 단어의 계층정보 및 연결성을 분석할 수 있다.5 is an example of a syntax tree according to an embodiment of the present invention. The syntax tree analyzes the morpheme of a sentence through syntax analysis, and analyzes it in the form of word parts of speech and sentence components. Therefore, using the information of the phrase tree, it is possible to analyze the hierarchical information and connectivity of words in a sentence.
도 3은 본 발명의 실시예에 따른 사용자 관심 단어의 계층정보를 반영하여 단어벡터정보를 이용, 유사성이 높은 단어를 추출, 온톨로지 정보를 확장하는 방법을 나타낸다. 3 shows a method of extracting words with high similarity, and expanding ontology information by using word vector information by reflecting hierarchical information of a user's interest word according to an embodiment of the present invention.
도 3의 S310 단계에서는 토픽 모델링을 위해 토픽모델의 개수를 정하는 단계이다. In step S310 of FIG. 3, the number of topic models is determined for topic modeling.
S320 단계에서는 토픽모델링 방법으로 추출된 토픽을 선정하여, 토픽에 포함된 단어들을 선정하는 단계이다. 이 때, 단어별 중요도 또는 단어별 토픽에 포함될 확률정보를 이용하여 사용될 단어를 선별할 수 있다. 따라서 도 3의 순서도는 토픽별로 수행된다.In step S320, a topic extracted by a topic modeling method is selected, and words included in the topic are selected. In this case, a word to be used may be selected using the importance of each word or probability information to be included in the topic of each word. Therefore, the flowchart of FIG. 3 is performed for each topic.
S330 단계는 사용자가 관심 단어를 선택하는 과정이다. 이 때 사용자는 하나의 단어 또는 다수의 단어그룹을 관심단어로 선택할 수 있다. S340 단계는 사용자가 주요 관심단어의 계층정보(구문트리1)와 연결정보를 정의하는 단계이다. 연결정보는 두 단어간의 다양한 의미적 연결관계로 표현될 수 있다. Step S330 is a process in which the user selects a word of interest. In this case, the user may select one word or a group of words as the word of interest. In step S340, the user defines hierarchical information (syntax tree 1) and connection information of the main word of interest. Connection information can be expressed in various semantic connection relationships between two words.
S340 단계는 사용자마다 주요 관심단어와 해당 단어간의 계층정보 및 연결정보를 다르게 정의할 수 있다. 따라서 사용자의 관심 단어 및 온톨로지 정보를 정의함으로써 시스템의 탐색 방향을 설정하는 단계이다.In step S340, the main word of interest and the hierarchical information and connection information between the corresponding word may be differently defined for each user. Therefore, it is a step of setting the search direction of the system by defining the word of interest and ontology information of the user.
S350단계에서는 사용자 관심단어가 포함된 문장들을 이용, 상기 S340단계에서 정의된 구문트리-A와 유사한 계층정보 갖는 구문트리-B를 추출하는 단계이다. 이 때, S360 단계에서는 구문트리-A와 구문트리-B2의 계층정보의 유사성과 구문트리상에 사용자 관심단어와의 단어 거리정보를 이용하여 유효단어를 결정한다.In step S350, the phrase tree-B having hierarchical information similar to the phrase tree-A defined in step S340 is extracted by using sentences containing the word of interest of the user. At this time, in step S360, a valid word is determined by using the similarity of the hierarchical information of the syntax tree-A and the syntax tree-B2 and the word distance information between the user's interest word on the syntax tree.
S370 단계는 추출된 유효 구문정보의 단어들을 이용하여 단어벡터정보 (word2vec, word-embedding)를 생성하는 단계이다. 단어벡터정보는 모든 단어들의 유사성 정보를 -1 ~ 1사이에서, 또는 0~1사이의 수치로 나타낸다. 유사성 수치가 높을수록 두 단어의 유사성이 높다. S370 단계에서 추출된 단어벡터정보를 이용, 사용자 관심단어와 유사성이 높은 단어 집합들을 추출할 수 있다. 이 때, 유효한 단어집합들은 통계처리 방법에 의해 결정할 수 있다. 각 사용자 관심단어와 유사도(similarity)가 높은 단어를 N개를 선정하고, 유효유사단어 그룹으로 추출한다. 관심단어의 크기 만큼, 유효 유사단어집합의 개수가 결정된다. 각 유효 유사단어집합의 평균-A, 표준편차-A, 분산-A과 같은 통계값을 추출하고, 각 유효유사단어와 상기 평균-A의 정보를 이용, 각 유효유사단어의 편차의 합이 작은 순서로 유효유사단어 그룹을 점수화할 수 있다. 또한 각 유효유사단어 그룹 사이의 분산값의 편차가 일정기준 보다 낮은 값을 갖도록 하기 위해, 유효유사단어 그룹을 조정할 수 있다. 따라서 상기 분산값 및 분산값의 편차는 유효유사단어를 결정할 수 있는 사용자 설정값으로 이용될 수 있다. 이러한 사용자 설정값은 S370 단계가 수행될 때마다 변경되고 적용될 수 있다. 또한 이와 같이 유효유사단어 그룹의 다수의 분산값이 존재하는 경우, 분산검정(ANOVA) 방법을 통해 일정 유의확률의 만족 여부를 결정할 수 있다. Step S370 is a step of generating word vector information (word2vec, word-embedding) using words of the extracted valid syntax information. The word vector information represents similarity information of all words as a value between -1 and 1, or between 0 and 1. The higher the similarity value, the higher the similarity between the two words. Using the word vector information extracted in step S370, it is possible to extract word sets having high similarity to the word of interest of the user. In this case, valid word sets can be determined by a statistical processing method. N words with high similarity to each user's interest word are selected and extracted as a group of valid similar words. The number of effective similar word sets is determined by the size of the word of interest. Statistical values such as mean-A, standard deviation-A, and variance-A of each effective similar word set are extracted, and the sum of the deviations of each effective similar word is small by using the information of each effective similar word and the mean-A. Groups of valid similar words can be scored in order. In addition, the effective similar word group can be adjusted so that the deviation of the variance value between each effective similar word group has a value lower than a predetermined standard. Therefore, the deviation of the variance value and the variance value may be used as a user setting value for determining a valid similar word . These user setting values may be changed and applied every time step S370 is performed. In addition, when there are multiple variance values of a group of effective similar words as described above, it is possible to determine whether or not a certain significance probability is satisfied through the variance test (ANOVA) method.
S380 단계는 S370에서 선택된 유효유사단어를 사용자 관심단어로 구성된 온톨로지에 포함시키는 단계이다. S380 단계가 완료되면, 확장된 온톨로지 정보를 바탕으로 S330 단계를 재수행한다. S330 단계를 다시 반복하는 경우, S380 단계에서 유효유사단어를 사용자 온톨로지에 포함되도록 사용자가 결정하는 단계를 포함할 수 있거나, 유효유사단어의 통계값이 임계값을 만족하는 경우, 사용자 온톨로지에 포함되고, 자동으로 S330 단계가 반복 수행되도록 설정할 수 있다. 또한 S340 단계가 다시 수행되는 경우에는 업데이트된 상기 사용자 온톨로지 정보를 사용할 수 있거나, 사용자가 온톨로지 정보를 수정할 수 있다. 따라서 S330 - S380 단계가 반복 수행되는 경우, 사용자가 정의한 온톨로지 정보가 지속적으로 업데이트되고, 신규로 유효유사단어가 단어벡터정보로 업데이트되게 된다. 또한 상기 S330 - S380 반복수행 단계는 더 이상의 유효유사단어가 추출되지 않을 때 까지 반복 수행될 수 있다.In step S380, the effective similar word selected in S370 is included in the ontology composed of the user's interest word. When step S380 is completed, step S330 is performed again based on the expanded ontology information. In the case of repeating step S330 again, the step of determining by the user to include the valid similar word in the user ontology in step S380 may be included, or if the statistical value of the valid similar word satisfies the threshold value, it is included in the user ontology. , It may be set to automatically repeat step S330. In addition, when step S340 is performed again, the updated user ontology information may be used, or the user may modify the ontology information. Therefore, when steps S330 to S380 are repeatedly performed, ontology information defined by a user is continuously updated, and a new valid similar word is updated with word vector information. In addition, the repeating steps S330-S380 may be repeatedly performed until no more effective similar words are extracted.
도 3에서 S360 단계는 G1 단계로 분기할 수 있다. 분기의 조건은 단어벡터정보로 사용자 관심단어와 유사성이 높은 방법외에, S360 단계에서 결정된 구문트리의 단어 계층정보를 이용하여 온톨로지 확장 단어를 결정하는 경우이다.In FIG. 3, step S360 may branch to step G1. The branching condition is a case in which an ontology extended word is determined using word hierarchical information of the syntax tree determined in step S360, in addition to a method having high similarity to the word of interest by using word vector information.
S410 단계는 구문트리의 계층정보에서 사용자 관심단어와 일정 거리내에 존재하는 유효단어를 행렬정보로 표현하는 단계이다. 상기 행렬값은 유효단어의 거리정보의 역수의 형태로 구성된다. 구문트리에서 사용자 관심단어와 유효단어의 거리가 짧을수록 연결성이 높아지므로, 거리정보가 작을수록 보다 유효하다. Step S410 is a step of expressing a valid word existing within a certain distance from the user interest word in the hierarchical information of the syntax tree as matrix information. The matrix value is configured in the form of an inverse number of distance information of valid words. The shorter the distance between the user's interest word and the valid word in the syntax tree, the higher the connectivity, so the smaller the distance information is, the more effective it is.
유효행렬을 구성하기 전에 한 문서내에 존재하는 유효단어 사이에 존재하는 링크 정보를 이용하여 네트워크 분석방법을 적용할 수 있다. 네트워크 정보에서 추출된 링크 점수를 바탕으로 유효단어를 선별 추출할 수 있다. 따라서 S410 단계에서는 사용자 유효단어를 추출하기 위해,사용자 관심단어와 유효단어 후보간의 거리 및 네트워크 정보의 링크점수 또는 순위에 대해 사용자 설정값을 참조하여 수행될 수 있다. 또한 이러한 사용자 설정값은 S360 단계나 S410 단계가 수행될 때마다 변경되어 적용될 수 있다. Before constructing a valid matrix, a network analysis method can be applied by using link information between valid words existing in a document. Valid words can be selectively extracted based on the link score extracted from the network information. Accordingly, in step S410, in order to extract the user valid word, the distance between the user interest word and the valid word candidate and the link score or ranking of network information may be performed with reference to a user set value. In addition, such a user setting value may be changed and applied each time step S360 or step S410 is performed.
S420 단계는 S410 단계에서는 생성된 모든 행렬정보를 이용, 전체 행렬로 그룹화하고, 이에 SVD 방법을 적용하는 단계이다.In step S420, all matrix information generated in step S410 is used, grouped into all matrices, and the SVD method is applied thereto.
S430 단계에서는 S420 단계에서의 SVD를 통해 얻은 대각행렬(eigenvalues) 값의 상위 값만을 이용하여 선별 SVD (truncated SVD) 선택하는 단계이다. 이 때, S440 단계에서는 선별된 SVD의 유효한 행렬값의 분포와 대각행렬 정보를 이용, 다수의 서브그룹으로 분리할 수 있다. 또한 S440 단계는 S430 단계에서 결정된 다수의 서브그룹을 포함하는 선별 SVD의 계층정보로 부터, 관련된 단어(entity) 정보를 추출하는 단계이다. In step S430, a truncated SVD (truncated SVD) is selected using only the upper value of the diagonal matrix (eigenvalues) obtained through the SVD in step S420. In this case, in step S440, the distribution of the effective matrix values of the selected SVD and information on the diagonal matrix may be used to separate them into a plurality of subgroups. In addition, step S440 is a step of extracting related word (entity) information from the hierarchical information of the selected SVD including a plurality of subgroups determined in step S430.
S450 단계는 사용자 관심단어로 구성된 온톨로지 정보와 연결성이 높은 선별 SVD 서브그룹을 결합시키는 단계이다. 선별 SVD 서브그룹에 포함된 주요단어와 계층정보를 이용하여 사용자 온톨로지 정보와 결합할 수 있다. 이 때, 선별 SVD 서브그룹은 가장 가까운 사용자 관심단어 사이의 거리 정보를 이용하여 결합될 수 있거나, 사용자가 선별 SVD 서브그룹이 결합될 관심단어를 지정할 수 있다. 상기 S450 단계에서는 하나 이상의 단어로 구성된 선별 SVD 서브그룹을 사용자 관심단어와 결합시키므로써, 사용자는 추천된 단어그룹의 의미를 보다 명확하게 이해할 수 있고, 사용자 온톨로지 정보의 확장을 용이하게 결정할 수 있다. S450 단계가 수행되면, 확장된 사용자 온톨로지 정보를 이용하여 도 3의 S330 단계를 재수행한다. 따라서 S330 단계부터 S450단계가 반복 수행되면서, 사용자 온톨로지 정보가 지속적으로 확장되고, 더 이상 신규 유효단어 후보가 생성되지 않을 때, 종료되거나 사용자가 의도적으로 반복수행 단계를 종료할 수 있다.In step S450, the ontology information composed of the user's interest word and the selected SVD subgroup with high connectivity are combined. Key words and hierarchical information included in the selected SVD subgroup can be used to combine with user ontology information. In this case, the selected SVD subgroup may be combined using distance information between the closest user interest words, or the user may designate a word of interest to which the selected SVD subgroup is to be combined. In step S450, by combining the selected SVD subgroup consisting of one or more words with the user interest word, the user can more clearly understand the meaning of the recommended word group and can easily determine the expansion of user ontology information. If step S450 is performed, step S330 of FIG. 3 is performed again using the expanded user ontology information. Therefore, as steps S330 to S450 are repeatedly performed, when the user ontology information is continuously expanded and no new valid word candidates are generated, it may be terminated or the user may intentionally terminate the repeating step.
도 6은 사용자 온톨로지 정보를 확장한 문서를 기반으로, 유사한 이미지 정보를 갖는 문서를 다른 토픽 문서에서 추출하는 방법이다. 이미지를 이용하여 텍스트 기반의 토픽모델에서 추출되지 않는 문서를 추출하는 방법이다.6 is a method of extracting a document having similar image information from another topic document based on a document extending user ontology information. This is a method of extracting documents that are not extracted from the text-based topic model using images.
S380 단계에서 G2로 분기하여 데이터를 처리한다. G2로 분기하는 조건은 확장된 온톨로지 문서내에 일정 개수 이상의 이미지 정보 사이의 유사성이 존재할 때, 적용되는 방법이다. 또한 일정 빈도의 유사이미지가 없더라도 사용자 결정에 의해 G2로 분기 처리될 수 있다.In step S380, it branches to G2 and processes the data. The condition for branching to G2 is a method applied when there is a similarity between a certain number or more of image information in the expanded ontology document. In addition, even if there is no similar image with a certain frequency, it can be branched to G2 by the user's decision.
도 6의 S610 단계에서 토픽문서-A에서 추출된 이미지의 유사성을 분석하는 단계이다. S620 단계에서는 토픽문서에서 대표도면을 결정하는 단계이다. 이 때 이미지 픽셀 윈도우에 대해 전체 이미지 영역을 스캔하여, 특징점을 추출하는 과정이다. S630 단계에서는 대표도면의 주요 객체를 사용자가 지정하거나 S620에서 추출한 특징점을 이용, 시스템이 주요 객체를 결정하는 단계이다. 대표도면의 주요 객체가 선정되면, S640에서는 토픽문서-A와 다른 토픽 문서들로 구성된 토픽문서-B에서 유사한 이미지를 갖는 토픽문서-C를 추출하는 과정이다. S640에서 서로 다른 토픽 문서에서 추출된 토픽문서-C를 이용, 도 3의 S310의 토픽모델링 단계를 재수행한다.This is a step of analyzing the similarity of the image extracted from topic document-A in step S610 of FIG. 6. In step S620, a representative drawing is determined in the topic document. In this case, the entire image area is scanned for the image pixel window, and feature points are extracted. In step S630, the system determines the main object by designating the main object of the representative drawing or using the feature points extracted in S620. When the main object of the representative drawing is selected, S640 is a process of extracting the topic document-C having a similar image from the topic document-B composed of the topic document-A and other topic documents. In S640, the topic modeling step of S310 of FIG. 3 is performed again using the topic document-C extracted from different topic documents.

Claims (6)

  1. 온톨로지 정보 확장 방법에 있어서,In the ontology information expansion method,
    시스템이 상기 온톨로지 정보를 포함하는 토픽문서-A 그룹을 추출하는 단계;Extracting, by a system, a topic document-A group including the ontology information;
    시스템이 상기 토픽문서-A 그룹의 대표-이미지를 결정하는 단계;Determining, by the system, a representative image of the topic document-A group;
    시스템이 상기 대표-이미지의 특징영역을 수신하는 단계;Receiving, by a system, a feature area of the representative-image;
    시스템이 상기 대표-이미지의 특징영역과 유사한 이미지 특징영역을 갖는 토픽문서-B를, 서로 다른 토픽으로 분류된 문서그룹에서 결정하는 단계; 및Determining, by the system, a topic document-B having an image feature region similar to that of the representative-image , from document groups classified into different topics ; And
    시스템이 상기-사용자 온톨로지 확장을 위해서,In order for the system to expand the above-user ontology,
    시스템이 상기 추출된 토픽문서-B를 이용하여, 토픽 모델링을 수행하는 온톨로지 정보 확장방법.An ontology information extension method in which a system performs topic modeling using the extracted topic document-B.
  2. 제 1항에 있어서,The method of claim 1,
    상기 토픽문서-B를 이용하여 토픽모델링을 수행데 있어서, In performing topic modeling using the topic document-B,
    시스템이 토픽모델의 개수를 수신하는 단계를 더 포함하는 온톨로지 정보 확장 방법.Ontology information expansion method further comprising the step of the system receiving the number of topic models.
  3. 온톨로지 정보 확장 방법에 있어서,In the ontology information expansion method,
    시스템이 동일한 토픽에 포함된 사용자-선택단어를 수신하는 단계;Receiving, by the system, user-selected words included in the same topic;
    시스템이 상기 사용자-선택단어 사이의 계층정보 및 연결정보(사용자 -온톨로지)를 수신하는 단계;Receiving, by a system, hierarchical information and connection information (user-ontology) between the user-selected words;
    시스템이 상기 사용자- 온톨로지를 이용, 상기 토픽에 포함된 문서에서 구문 트리를 생성하는 단계;Generating, by a system, a syntax tree from a document included in the topic using the user-ontology;
    시스템이 상기 구문트리 정보를 이용, 유효-단어를 결정하는 단계; 및Determining, by the system, a valid-word using the syntax tree information; And
    시스템이 상기-사용자 온톨로지 확장을 위해서,In order for the system to expand the above-user ontology,
    시스템이 단어벡터정보 적용 결정 또는 선별-SVD 적용 결정 정보를 수신하는 단계를 포함하는 온톨로지 확장 방법.An ontology expansion method comprising the step of: receiving, by a system, word vector information application determination or selection-SVD application determination information.
  4. 제 1항에 있어서,The method of claim 1,
    상기 구문트리를 결정하는 단계의 상기 유효-단어 결정하기 위해서,In order to determine the valid-word in the step of determining the syntax tree,
    시스템이 사용자-온톨로지와의 거리 정보를 수신하는 단계를 더 포함하는 온톨로지 정보 확장 방법.The method of extending ontology information, further comprising the step of: receiving, by the system, distance information from the user-ontology.
  5. 제 1항에 있어서,The method of claim 1,
    상기 단어벡터정보 적용하여 유효-유사단어를 결정을 위해서,In order to determine a valid-similar word by applying the word vector information,
    시스템이 분산값 및 분산값 사이의 편차 정보를 수신하는 단계를 더 포함하는 온톨로지 정보 확장 방법.The method of extending ontology information, further comprising the step of: receiving, by the system, information on the variance value and the deviation information between the variance values.
  6. 제 1항에 있어서,The method of claim 1,
    상기 선별-SVD 적용하여, 선별-SVD-서브그룹 결정을 위해서,By applying the screening-SVD, to determine the screening-SVD-subgroup,
    시스템이 선별-SVD-서브그룹의 개수를 수신하는 단계를 더 포함하는 온톨로지 정보 확장 방법.Ontology information expansion method further comprising the step of the system receiving the number of screening-SVD-subgroups.
PCT/KR2019/011054 2019-08-29 2019-08-29 Method for expanding ontology data in heterogenous topic document on basis of image similarity WO2021040089A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2019/011054 WO2021040089A1 (en) 2019-08-29 2019-08-29 Method for expanding ontology data in heterogenous topic document on basis of image similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2019/011054 WO2021040089A1 (en) 2019-08-29 2019-08-29 Method for expanding ontology data in heterogenous topic document on basis of image similarity

Publications (1)

Publication Number Publication Date
WO2021040089A1 true WO2021040089A1 (en) 2021-03-04

Family

ID=74684044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/011054 WO2021040089A1 (en) 2019-08-29 2019-08-29 Method for expanding ontology data in heterogenous topic document on basis of image similarity

Country Status (1)

Country Link
WO (1) WO2021040089A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110044562A (en) * 2009-10-23 2011-04-29 동국대학교 산학협력단 Method and apparatus for measuring subject of document using ontology
KR101442719B1 (en) * 2013-04-16 2014-09-19 한양대학교 에리카산학협력단 Apparatus and method for recommendation of academic paper
JP2016071412A (en) * 2014-09-26 2016-05-09 キヤノン株式会社 Image classification apparatus, image classification system, image classification method, and program
JP5994974B2 (en) * 2012-05-31 2016-09-21 サターン ライセンシング エルエルシーSaturn Licensing LLC Information processing apparatus, program, and information processing method
KR20180087772A (en) * 2017-01-25 2018-08-02 주식회사 카카오 Method for clustering and sharing images, and system and application implementing the same method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110044562A (en) * 2009-10-23 2011-04-29 동국대학교 산학협력단 Method and apparatus for measuring subject of document using ontology
JP5994974B2 (en) * 2012-05-31 2016-09-21 サターン ライセンシング エルエルシーSaturn Licensing LLC Information processing apparatus, program, and information processing method
KR101442719B1 (en) * 2013-04-16 2014-09-19 한양대학교 에리카산학협력단 Apparatus and method for recommendation of academic paper
JP2016071412A (en) * 2014-09-26 2016-05-09 キヤノン株式会社 Image classification apparatus, image classification system, image classification method, and program
KR20180087772A (en) * 2017-01-25 2018-08-02 주식회사 카카오 Method for clustering and sharing images, and system and application implementing the same method

Similar Documents

Publication Publication Date Title
CN109388795B (en) Named entity recognition method, language recognition method and system
Mitra et al. An automatic approach to identify word sense changes in text media across timescales
CN104881458B (en) A kind of mask method and device of Web page subject
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
WO2021051864A1 (en) Dictionary expansion method and apparatus, electronic device and storage medium
CN111444723A (en) Information extraction model training method and device, computer equipment and storage medium
CN109508460B (en) Unsupervised composition running question detection method and unsupervised composition running question detection system based on topic clustering
CN113590810B (en) Abstract generation model training method, abstract generation device and electronic equipment
CN110516259B (en) Method and device for identifying technical keywords, computer equipment and storage medium
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN111309916A (en) Abstract extraction method and device, storage medium and electronic device
CN112527977B (en) Concept extraction method, concept extraction device, electronic equipment and storage medium
CN109117477B (en) Chinese field-oriented non-classification relation extraction method, device, equipment and medium
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
CN111626291A (en) Image visual relationship detection method, system and terminal
CN116109732A (en) Image labeling method, device, processing equipment and storage medium
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
CN113158667B (en) Event detection method based on entity relationship level attention mechanism
CN112015903B (en) Question duplication judging method and device, storage medium and computer equipment
Charoenpornsawat et al. Feature-based thai unknown word boundary identification using winnow
CN111681731A (en) Method for automatically marking colors of inspection report
WO2021040089A1 (en) Method for expanding ontology data in heterogenous topic document on basis of image similarity
CN111930885A (en) Method and device for extracting text topics and computer equipment
CN110069780B (en) Specific field text-based emotion word recognition method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943363

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19943363

Country of ref document: EP

Kind code of ref document: A1