KR102025819B1

KR102025819B1 - Device and method for constructing dynamic-terms identification system in user gererated contents

Info

Publication number: KR102025819B1
Application number: KR1020180036501A
Authority: KR
Inventors: 한상용; 서지완; 최승진; 김준호
Original assignee: 중앙대학교 산학협력단
Priority date: 2017-03-29
Filing date: 2018-03-29
Publication date: 2019-09-26
Also published as: KR20180110639A

Abstract

The present invention relates to an apparatus and method for constructing a dynamic term identification scheme, and more particularly, to an apparatus and method for constructing dynamic term identification scheme of user-generated content. The present invention can provide a dynamic term identification system construction apparatus and method that can consider the meaning of terms dynamically even with time changes and can distinguish each meaning even in the case of homonyms.

Description

DEVICE AND METHOD FOR CONSTRUCTING DYNAMIC-TERMS IDENTIFICATION SYSTEM IN USER GERERATED CONTENTS}

The present invention relates to an apparatus and method for constructing a dynamic term identification scheme, and more particularly, to an apparatus and method for constructing dynamic term identification scheme of user-generated content.

With the development of IT technology, various query retrieval systems are being serviced. One of the most popular services used by people around the world is Google. Such a query search system finds the most similar contents by using a query input by a user. However, if the user does not have a good understanding of the domain, it may be difficult to reach the desired information. In order to easily reach the information the user wants, the system needs to provide the appropriate words.

In order to provide a suitable term for the user's initial query, a term identification system is required. In addition, a term identification system is also required when analyzing and utilizing documents.

At this time, when using a static term identification system that does not reflect the generation and change of terms over time, the system may not operate properly. An example of a static term identification scheme is WordNet. WordNet is a word dictionary that categorizes words into a set of meanings to enable associations between words. In the case of using a static term identification system such as WordNet, the system may not operate properly when a new term emerges or the meaning or use of the term changes. In order to solve these problems, it is necessary to construct a term identification system dynamically.

Background art of the present invention is disclosed in Republic of Korea Patent Publication No. 2009-0048261 (published May 13, 2009).

It is an object of the present invention to solve the above and other problems.

Another object of the present invention is to provide an apparatus and method for constructing a dynamic term identification system that dynamically considers the meaning of terms even with time, and distinguishes each meaning even in the case of homonyms.

The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned above will be clearly understood by those skilled in the art from the following description. Could be.

In order to solve the above problems, a method for constructing a dynamic term identification system according to an embodiment of the present invention comprises the steps of processing the user-generated content in the form of terms; Converting the processed term into a vector of vector space; Extracting candidate words associated with a query based on the vector to form a cluster; And determining the total number of clusters based on the result of evaluating the clusters.

The result of evaluating the cluster may include a DBs index (Davies Bouldin index) for the cluster.

The DB index (Davies Bouldin index, DB) is determined using the following equation 1,

[Equation 1]

N is the number of clusters,

Is the center point of cluster x,

Is the center point from all data objects in cluster x

Is the average value of the distance up to

Is the center point

And center point

It can be the distance between.

The determining of the total number of clusters may include: calculating a first result evaluated when the total number of clusters is a natural number N; Calculating a second result evaluated when the total number of clusters is N + 1; And comparing the first result with the second result to determine the total number of clusters.

The results of evaluating the clusters may include inter-cluster distribution information on the distance between each cluster; And variance information in a cluster of distances between vectors corresponding to each term in a cluster.

An apparatus for establishing a dynamic term identification system according to another embodiment of the present invention includes a transceiver for transmitting and receiving data; And a processor, wherein the processor processes the user-generated content into a form of a term, converts the processed term into a vector of a vector space, extracts candidate words associated with the query based on the vector, and forms a cluster. The total number of clusters may be determined based on a result of evaluating the clusters.

[Equation 1]

N is the number of clusters,

Is the center point of cluster x,

Is the center point from all data objects in cluster x

Is the average value of the distance up to

Is the center point

And center point

It can be the distance between.

In determining the number of total clusters, the processor calculates a first result evaluated when the total number of clusters is a natural number N, and calculates a second result evaluated when the total number of clusters is N + 1. And comparing the first result with the second result to determine the total number of clusters.

The foregoing general description and the following detailed description of the invention are exemplary and intended for further explanation of the invention as described in the claims.

According to an embodiment of the present invention, an apparatus and method for constructing a dynamic term identification system of user-generated content may be provided.

In addition, according to an embodiment of the present invention it is possible to provide an apparatus and method for constructing a dynamic term identification system that dynamically considers the meaning of terms even with time, and can distinguish each meaning even in the case of homonyms.

Effects obtained in the present invention are not limited to the above-mentioned effects, and other effects not mentioned above may be clearly understood by those skilled in the art from the following description. will be.

BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description in order to provide a thorough understanding of the present invention, provide an embodiment of the present invention and together with the description, illustrate the technical idea of the present invention.
1 is a flowchart illustrating a method of establishing a dynamic term identification system of user-generated content according to an embodiment of the present invention.
2 illustrates a clustering method according to an embodiment of the present invention.
3 illustrates an example of a related term list when a query input by a user is a doctor.
4 illustrates an example of a related term list when a query input by a user is 'memory'.
FIG. 5 illustrates an example of a related term list when a user input query is 'graphene'.
6 illustrates an example of a related term list when a query input by a user is a combination of a hospital and a doctor.
7 illustrates a computing device in accordance with one embodiment of the present invention.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings, and the same or similar components are denoted by the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used in consideration of ease of specification, and do not have distinct meanings or roles from each other. In addition, in describing the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are intended to facilitate understanding of the embodiments disclosed herein, but are not limited to the technical spirit disclosed herein by the accompanying drawings, all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

Singular expressions include plural expressions unless the context clearly indicates otherwise.

In this application, the terms "comprises" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

1 is a flowchart illustrating a method of establishing a dynamic term identification system of user-generated content according to an embodiment of the present invention.

Referring to FIG. 1, first, user-generated content (data) is preprocessed in operation S110. That is, the user-generated content may be processed into a form of terms that are easy to analyze.

User-generated contents may be composed of various types of data such as text, images, and videos generated by users. In particular, the development of social media has generated a lot of unstructured text data.

As a specific example of preprocessing user-generated content, first, a stopword may be deleted from a user-generated content and a term may be extracted.

Terminology refers to words that are not suitable as search terms or that are not necessary for document analysis. For example, some or all of special characters, numbers, emoticons, articles, prepositions, surveys, conjunctions, and the like can be determined as stopwords.

In addition, after deleting the stopwords, the morphological analysis may be further performed. Morphological analysis breaks a document or sentence into morphemes, the smallest semantic units. Finally, sentences can be broken down into words in parts of speech. Generally, a term is a representative of a book or a document, and the term is defined below as a noun.

Next, the term extracted in step S120 is projected into the vector space. That is, terms can be converted into vectors in vector space.

For example, terms extracted in step S110 may be projected in a vector space using word-to-vector processing. Word2vec is a word embedding technique that mathematically expresses the semantic association between words. In other words, Word2Vec is an algorithm that can efficiently estimate the meaning of words in a vector space.

Word2vec can perform two-step neural network training using continuous bag-of-words or skip-gram models to project words into vector space.

The CBOW model uses the following logic flow: If you have a sentence that says, "I ate ice cream right away and 'X' was so cold, it was hard to eat." Can be. Since the mathematical calculation method of the CBOW model does not correspond to the gist of the present invention, a detailed description thereof will be omitted.

The skip-gram model can be thought of as a model in the opposite direction to the CBOW model. One word is given to infer some of the surrounding words. Like the CBOW model, the mathematical calculation method of the skip-gram model does not correspond to the gist of the present invention, and thus detailed description thereof will be omitted.

When Word2vec processing is performed, words with higher associations between words are projected in more similar vector spaces. Thus, the vector information can be used to infer the relationship between the two terms. For example, cosine-similarity can be used to infer the similarity between two terms. The higher the relevance, such as the similar use of the word or the same document, the higher the similarity will be.

The resulting vector of Word2Vec has the basic properties of a vector. For example, when there is a first vector for the first word and a second vector for the second word, the two vectors may be added to calculate the third vector. In addition, since the third vector also has a unique vector value, it is possible to operate with other word vectors. In this way, Word2Vec can be used to compare words and their relationships and apply them to dynamic term identification schemes and clustering.

However, the conventional Word2vec only provides an association of terms, and it is difficult to provide a more precise relationship. In particular, the homonym can confuse the user in that the words associated with the two meanings can be provided simultaneously without distinguishing them. According to the present invention, the following steps can be alleviated and the coherence of terms can be enhanced.

Next, in step S130, the candidate word associated with the query word is extracted using vector information, and clustering is performed.

Clustering is data that can be used in many fields, including machine learning, data mining, pattern recognition, image analysis, and bioinformatics. Analytical skills. Clustering means dividing one data into several subsets or clusters. The data in each cluster shares some common Traits. Clustering can be accomplished by calculating similarity or proximity using several distance measurements.

2 illustrates a clustering method according to an embodiment of the present invention. According to an embodiment of the present invention, the clustering process as shown in FIG. 2 may be performed to utilize the relationship between all terms and to increase the consistency among the terms.

Referring to FIG. 2, a top-n word list having a high correlation with a query input by a user is extracted based on vector information of a term obtained in step S120. Among them, the top-1 word that is most relevant to the user's query is defined as the head of the cluster, and the relationship is compared with other terms in the association list. At this time, the threshold value is defined using the maximum value and the minimum value, and the first cluster is formed with terms above the threshold value.

The next cluster defines the head with the least similarity to the head of the first cluster and repeats it. You can repeat this method to get as many clusters as you want. In general, the meaning or use of terms is not infinite, so the number of clusters should be at least two. Clustering reflects the relationship between each term, and candidate terms form a cluster. Through this, it is possible to increase the consistency of classification or related words for terms with homonyms.

Referring back to FIG. 1, in step S140, clusters may be evaluated and the total number of clusters may be determined based on the results.

You can use prior knowledge or information to determine the appropriate total number of clusters. In other words, the number of clusters determined based on the prior information is used.

On the other hand, the information of the cluster itself may be used to determine the appropriate number of clusters. For example, distributed information within and between clusters may be used. In the case of inter cluster variance information, the distance of each cluster is maximized. As clustering works well, each cluster can be thought to be clearly separated. That is, the distance between clusters is far. On the other hand, inner-cluster variance information refers to the distance between terms forming one cluster. If variance in clusters is minimized, one cluster can be said to form clusters of terms of similar nature.

In the present invention, clustering is used to classify other terms that are similar to the query entered by the user. That is, an object of the present invention is to cluster related terms according to the use of related terms, and to provide information about the related terms. Therefore, there is a need for a clustering and evaluation method that can properly classify the meanings of related terms found according to a user's query term.

For example, the Elbow method can be used to determine the appropriate number of clusters. The elbow method is a method of determining the number of clusters by comparing a cluster with a previous cluster. For example, when increasing the number of clusters (K) on the X-axis in order from 1 to graphing the evaluation value on the Y-axis, the K value at the point where the graph is bent (elbow) is selected. In this case, a DB index (Davies-Bouldin index) may be used to evaluate cluster suitability according to the number of clusters.

The DB index is a measure of how well clustering works. It is a method of clustering datasets and evaluating them. The DB index has high intra-cluster similarity in clusters and gives low scores to results with low inter-cluster similarity between clusters.

The DB index can be calculated using Equation 1 below.

n is the number of clusters,

Is the center point of cluster x,

Is the center point from all data objects in cluster x

Is the average of the distances to

Is the center point

And center point

Distance between.

Clustering that has high similarity in clusters and low similarity between clusters has a low DB index. Therefore, a clustering method with a low DB index can be evaluated as a good clustering method.

In the present invention, the DB index is calculated while increasing the number of clusters, and the number of clusters can be determined by comparing the value of the previous stage with the value of the current stage. The method using the DB index is not an evaluation of the true true value of the cluster, but it has the advantage of being able to quickly determine the number.

3 through 6 illustrate examples of clustering terms associated with a query input by a user using the above-described method. The terms associated with the query words refer to terms that have high similarity with the query word.

The related term search may be performed in two ways: 'single term search' (example of FIGS. 3 to 5) and 'compound term search' (example of FIG. 6). Single term search is when the user's query is one specific term and performs similarity measurement with other terms based on the word vector corresponding to the user's query. The compound term search refers to a case where a plurality of words are input by a query.

3 illustrates an example of a related term list when a query input by a user is a doctor. Referring to FIG. 3, it can be seen that the related terms of the first group (group 1) are related to a doctor of a hospital such as a patient, a medical staff, a medical treatment, a disease name, and a nurse. On the other hand, the second and third groups can see that words related to the intention of decision making such as decision makers, decision makers, decision makers, and choices are listed.

That is, when the user inputs a query of 'doctor', 'doctor' may mean a doctor of decision making or a doctor of a hospital. According to the present invention, it is possible to assist the user in determining the linked terms clustered differently for each homonym to the user.

In addition, when determining the number of clustering in the example of Figure 3, while increasing the number of clustering to two (for example, cluster 1 and the remaining cluster), three (group 1, cluster 2, cluster 3), etc. Can be evaluated separately, and the number of clustering can be determined at the elbow where the change of the evaluation index is the most severe.

4 illustrates an example of a related term list when a query input by a user is 'memory'. In FIG. 4, cluster 1 includes terms related to access, cluster 2 includes terms related to nonvolatile, NAND, and flash, and cluster 3 includes terms related to bits, addresses, and programming.

FIG. 5 illustrates an example of a related term list when a user input query is 'graphene'. In FIG. 5, cluster 1 includes terms associated with carbon nano, nano, carbon nano, etc., cluster 2 includes terms associated with graphene, graphene layers, nanowires, etc., and cluster 3 includes terms related to exfoliation and direct growth. .

As shown in FIGS. 4 and 5, each cluster may represent linked words of terms having different meanings, and may also suggest various methodologies associated with domains.

FIG. 6 illustrates a result of a complex query of a “hospital” and a “doctor”, unlike FIGS. 3 to 5, which were a case of a single term search. In the example of FIG. 3, a single query 'doctor' has two meanings, and it can be seen that an intention and a doctor in a hospital are separately clustered. However, if the word 'hospital' is entered as 'doctor', the meaning of 'doctor' becomes clear as the doctor of the hospital and the related terms related to 'doctor' and 'hospital' can be found. In the case of the first cluster (Cluster 1), a list of terms related to care was created, and the second cluster (Cluster 2) was clustered into domain-linked terms in the hospital such as nurse and care.

The present invention can provide an associated term search service that can dynamically identify terms through learning about user input content or documents. The association term search service allows a user to be provided with term information related to the query. Also, the cluster terminology information related to the query may be used to find information about the cluster term of the query term and the related term.

7 illustrates a computing device 700 in accordance with one embodiment of the present invention. Referring to FIG. 7, a computing device 700 according to an embodiment of the present invention may include a processor 710, a memory 720, and a transceiver 730.

The processor 710 may control the operation of the computing device 700. In addition, the processor 710 may be configured to implement the procedures and / or methods proposed in the present invention. The memory 720 may store information for implementing the procedures and / or methods proposed by the present invention, and may be a nonvolatile or volatile memory. The transceiver 730 may transmit or receive various signals, data, and information with an external device or another device therein.

The document similarity analysis method according to various embodiments of the present disclosure may be implemented in the form of program instructions that may be executed through computer means such as various servers. In addition, a program for executing the document similarity analysis method according to the present invention may be installed in computer means and recorded in a computer readable medium. Computer-readable media may include, alone or in combination with the program instructions, data files, data structures, and the like. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Hardware devices specially configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory and the like.

So far, the present invention has been described with reference to the embodiments. However, the above description of the present invention is for illustration, and those skilled in the art may understand that the present invention can be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. Could be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. The scope of the invention is indicated by the claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the invention.

Claims

In the method of establishing a dynamic term identification system,
Processing the user generated content in the form of terms;
Converting the processed term into a vector of vector space;
Extracting candidate words associated with a query based on the vector to form a cluster; And
Determining the total number of clusters based on the result of evaluating the clusters
Including,
The result of evaluating the cluster includes a DBs index (Davies Bouldin index) for the cluster,
The DB index (Davies Bouldin index, DB) is determined using the following equation 1,
[Equation 1]

N is the number of clusters,

Is the center point of cluster x,

Is the center point from all data objects in cluster x

Is the average value of the distance up to

Is the center point

And center point

And a distance between the two.

delete

The method of claim 1,
Determining the total number of clusters,
Calculating a first result evaluated when the total number of clusters is a natural number N;
Calculating a second result evaluated when the total number of clusters is N + 1; And
Comparing the first result and the second result to determine the total number of clusters.

The method of claim 1,
The result of evaluating the cluster,
Inter-cluster distribution information on the distance between each cluster; And
A method of constructing a dynamic term identification scheme, determined based on variance information in a cluster of distances between vectors corresponding to each term in a cluster.

In the apparatus for constructing a dynamic term identification system,
A transceiver for transmitting and receiving data; And
Includes a processor,
The processor is
Process user-generated content into terms,
Convert the processed term into a vector of vector space,
Extracting candidate words associated with the query based on the vector to form a cluster,
Configured to determine the total number of clusters based on the result of evaluating the clusters,
The result of evaluating the cluster includes a DBes index (Davies Bouldin index) for the cluster,
The DB index (Davies Bouldin index, DB) is determined using the following equation 1,
[Equation 1]

N is the number of clusters,

Is the center point of cluster x,

Is the center point from all data objects in cluster x

Is the average value of the distance up to

Is the center point

And center point

And a distance between the two.

delete

The method of claim 6,
The processor is configured to determine the total number of clusters,
Calculating a first result evaluated when the total number of clusters is a natural number N,
Calculating a second result evaluated when the total number of clusters is N + 1,
And compare the first result and the second result to determine the total number of clusters.

The method of claim 6,
The result of evaluating the cluster,
Inter-cluster distribution information on the distance between each cluster; And
And determine based on variance information in a cluster about a distance between vectors corresponding to each term in one cluster.