KR102025819B1 - Device and method for constructing dynamic-terms identification system in user gererated contents - Google Patents

Device and method for constructing dynamic-terms identification system in user gererated contents Download PDF

Info

Publication number
KR102025819B1
KR102025819B1 KR1020180036501A KR20180036501A KR102025819B1 KR 102025819 B1 KR102025819 B1 KR 102025819B1 KR 1020180036501 A KR1020180036501 A KR 1020180036501A KR 20180036501 A KR20180036501 A KR 20180036501A KR 102025819 B1 KR102025819 B1 KR 102025819B1
Authority
KR
South Korea
Prior art keywords
cluster
clusters
result
total number
term
Prior art date
Application number
KR1020180036501A
Other languages
Korean (ko)
Other versions
KR20180110639A (en
Inventor
한상용
서지완
최승진
김준호
Original Assignee
중앙대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 중앙대학교 산학협력단 filed Critical 중앙대학교 산학협력단
Publication of KR20180110639A publication Critical patent/KR20180110639A/en
Application granted granted Critical
Publication of KR102025819B1 publication Critical patent/KR102025819B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to an apparatus and method for constructing a dynamic term identification scheme, and more particularly, to an apparatus and method for constructing dynamic term identification scheme of user-generated content. The present invention can provide a dynamic term identification system construction apparatus and method that can consider the meaning of terms dynamically even with time changes and can distinguish each meaning even in the case of homonyms.

Description

DEVICE AND METHOD FOR CONSTRUCTING DYNAMIC-TERMS IDENTIFICATION SYSTEM IN USER GERERATED CONTENTS}

The present invention relates to an apparatus and method for constructing a dynamic term identification scheme, and more particularly, to an apparatus and method for constructing dynamic term identification scheme of user-generated content.

With the development of IT technology, various query retrieval systems are being serviced. One of the most popular services used by people around the world is Google. Such a query search system finds the most similar contents by using a query input by a user. However, if the user does not have a good understanding of the domain, it may be difficult to reach the desired information. In order to easily reach the information the user wants, the system needs to provide the appropriate words.

In order to provide a suitable term for the user's initial query, a term identification system is required. In addition, a term identification system is also required when analyzing and utilizing documents.

At this time, when using a static term identification system that does not reflect the generation and change of terms over time, the system may not operate properly. An example of a static term identification scheme is WordNet. WordNet is a word dictionary that categorizes words into a set of meanings to enable associations between words. In the case of using a static term identification system such as WordNet, the system may not operate properly when a new term emerges or the meaning or use of the term changes. In order to solve these problems, it is necessary to construct a term identification system dynamically.

Background art of the present invention is disclosed in Republic of Korea Patent Publication No. 2009-0048261 (published May 13, 2009).

It is an object of the present invention to solve the above and other problems.

Another object of the present invention is to provide an apparatus and method for constructing a dynamic term identification system that dynamically considers the meaning of terms even with time, and distinguishes each meaning even in the case of homonyms.

The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned above will be clearly understood by those skilled in the art from the following description. Could be.

In order to solve the above problems, a method for constructing a dynamic term identification system according to an embodiment of the present invention comprises the steps of processing the user-generated content in the form of terms; Converting the processed term into a vector of vector space; Extracting candidate words associated with a query based on the vector to form a cluster; And determining the total number of clusters based on the result of evaluating the clusters.

The result of evaluating the cluster may include a DBs index (Davies Bouldin index) for the cluster.

The DB index (Davies Bouldin index, DB) is determined using the following equation 1,

[Equation 1]

Figure 112018031336919-pat00001

N is the number of clusters,

Figure 112018031336919-pat00002
Is the center point of cluster x,
Figure 112018031336919-pat00003
Is the center point from all data objects in cluster x
Figure 112018031336919-pat00004
Is the average value of the distance up to
Figure 112018031336919-pat00005
Is the center point
Figure 112018031336919-pat00006
And center point
Figure 112018031336919-pat00007
It can be the distance between.

The determining of the total number of clusters may include: calculating a first result evaluated when the total number of clusters is a natural number N; Calculating a second result evaluated when the total number of clusters is N + 1; And comparing the first result with the second result to determine the total number of clusters.

The results of evaluating the clusters may include inter-cluster distribution information on the distance between each cluster; And variance information in a cluster of distances between vectors corresponding to each term in a cluster.

An apparatus for establishing a dynamic term identification system according to another embodiment of the present invention includes a transceiver for transmitting and receiving data; And a processor, wherein the processor processes the user-generated content into a form of a term, converts the processed term into a vector of a vector space, extracts candidate words associated with the query based on the vector, and forms a cluster. The total number of clusters may be determined based on a result of evaluating the clusters.

The result of evaluating the cluster may include a DBs index (Davies Bouldin index) for the cluster.

The DB index (Davies Bouldin index, DB) is determined using the following equation 1,

[Equation 1]

Figure 112018031336919-pat00008

N is the number of clusters,

Figure 112018031336919-pat00009
Is the center point of cluster x,
Figure 112018031336919-pat00010
Is the center point from all data objects in cluster x
Figure 112018031336919-pat00011
Is the average value of the distance up to
Figure 112018031336919-pat00012
Is the center point
Figure 112018031336919-pat00013
And center point
Figure 112018031336919-pat00014
It can be the distance between.

In determining the number of total clusters, the processor calculates a first result evaluated when the total number of clusters is a natural number N, and calculates a second result evaluated when the total number of clusters is N + 1. And comparing the first result with the second result to determine the total number of clusters.

The results of evaluating the clusters may include inter-cluster distribution information on the distance between each cluster; And variance information in a cluster of distances between vectors corresponding to each term in a cluster.

The foregoing general description and the following detailed description of the invention are exemplary and intended for further explanation of the invention as described in the claims.

According to an embodiment of the present invention, an apparatus and method for constructing a dynamic term identification system of user-generated content may be provided.

In addition, according to an embodiment of the present invention it is possible to provide an apparatus and method for constructing a dynamic term identification system that dynamically considers the meaning of terms even with time, and can distinguish each meaning even in the case of homonyms.

Effects obtained in the present invention are not limited to the above-mentioned effects, and other effects not mentioned above may be clearly understood by those skilled in the art from the following description. will be.

BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description in order to provide a thorough understanding of the present invention, provide an embodiment of the present invention and together with the description, illustrate the technical idea of the present invention.
1 is a flowchart illustrating a method of establishing a dynamic term identification system of user-generated content according to an embodiment of the present invention.
2 illustrates a clustering method according to an embodiment of the present invention.
3 illustrates an example of a related term list when a query input by a user is a doctor.
4 illustrates an example of a related term list when a query input by a user is 'memory'.
FIG. 5 illustrates an example of a related term list when a user input query is 'graphene'.
6 illustrates an example of a related term list when a query input by a user is a combination of a hospital and a doctor.
7 illustrates a computing device in accordance with one embodiment of the present invention.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings, and the same or similar components are denoted by the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used in consideration of ease of specification, and do not have distinct meanings or roles from each other. In addition, in describing the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are intended to facilitate understanding of the embodiments disclosed herein, but are not limited to the technical spirit disclosed herein by the accompanying drawings, all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

Singular expressions include plural expressions unless the context clearly indicates otherwise.

In this application, the terms "comprises" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

1 is a flowchart illustrating a method of establishing a dynamic term identification system of user-generated content according to an embodiment of the present invention.

Referring to FIG. 1, first, user-generated content (data) is preprocessed in operation S110. That is, the user-generated content may be processed into a form of terms that are easy to analyze.

User-generated contents may be composed of various types of data such as text, images, and videos generated by users. In particular, the development of social media has generated a lot of unstructured text data.

As a specific example of preprocessing user-generated content, first, a stopword may be deleted from a user-generated content and a term may be extracted.

Terminology refers to words that are not suitable as search terms or that are not necessary for document analysis. For example, some or all of special characters, numbers, emoticons, articles, prepositions, surveys, conjunctions, and the like can be determined as stopwords.

In addition, after deleting the stopwords, the morphological analysis may be further performed. Morphological analysis breaks a document or sentence into morphemes, the smallest semantic units. Finally, sentences can be broken down into words in parts of speech. Generally, a term is a representative of a book or a document, and the term is defined below as a noun.

Next, the term extracted in step S120 is projected into the vector space. That is, terms can be converted into vectors in vector space.

For example, terms extracted in step S110 may be projected in a vector space using word-to-vector processing. Word2vec is a word embedding technique that mathematically expresses the semantic association between words. In other words, Word2Vec is an algorithm that can efficiently estimate the meaning of words in a vector space.

Word2vec can perform two-step neural network training using continuous bag-of-words or skip-gram models to project words into vector space.

The CBOW model uses the following logic flow: If you have a sentence that says, "I ate ice cream right away and 'X' was so cold, it was hard to eat." Can be. Since the mathematical calculation method of the CBOW model does not correspond to the gist of the present invention, a detailed description thereof will be omitted.

The skip-gram model can be thought of as a model in the opposite direction to the CBOW model. One word is given to infer some of the surrounding words. Like the CBOW model, the mathematical calculation method of the skip-gram model does not correspond to the gist of the present invention, and thus detailed description thereof will be omitted.

When Word2vec processing is performed, words with higher associations between words are projected in more similar vector spaces. Thus, the vector information can be used to infer the relationship between the two terms. For example, cosine-similarity can be used to infer the similarity between two terms. The higher the relevance, such as the similar use of the word or the same document, the higher the similarity will be.

The resulting vector of Word2Vec has the basic properties of a vector. For example, when there is a first vector for the first word and a second vector for the second word, the two vectors may be added to calculate the third vector. In addition, since the third vector also has a unique vector value, it is possible to operate with other word vectors. In this way, Word2Vec can be used to compare words and their relationships and apply them to dynamic term identification schemes and clustering.

However, the conventional Word2vec only provides an association of terms, and it is difficult to provide a more precise relationship. In particular, the homonym can confuse the user in that the words associated with the two meanings can be provided simultaneously without distinguishing them. According to the present invention, the following steps can be alleviated and the coherence of terms can be enhanced.

Next, in step S130, the candidate word associated with the query word is extracted using vector information, and clustering is performed.

Clustering is data that can be used in many fields, including machine learning, data mining, pattern recognition, image analysis, and bioinformatics. Analytical skills. Clustering means dividing one data into several subsets or clusters. The data in each cluster shares some common Traits. Clustering can be accomplished by calculating similarity or proximity using several distance measurements.

2 illustrates a clustering method according to an embodiment of the present invention. According to an embodiment of the present invention, the clustering process as shown in FIG. 2 may be performed to utilize the relationship between all terms and to increase the consistency among the terms.

Referring to FIG. 2, a top-n word list having a high correlation with a query input by a user is extracted based on vector information of a term obtained in step S120. Among them, the top-1 word that is most relevant to the user's query is defined as the head of the cluster, and the relationship is compared with other terms in the association list. At this time, the threshold value is defined using the maximum value and the minimum value, and the first cluster is formed with terms above the threshold value.

The next cluster defines the head with the least similarity to the head of the first cluster and repeats it. You can repeat this method to get as many clusters as you want. In general, the meaning or use of terms is not infinite, so the number of clusters should be at least two. Clustering reflects the relationship between each term, and candidate terms form a cluster. Through this, it is possible to increase the consistency of classification or related words for terms with homonyms.

Referring back to FIG. 1, in step S140, clusters may be evaluated and the total number of clusters may be determined based on the results.

You can use prior knowledge or information to determine the appropriate total number of clusters. In other words, the number of clusters determined based on the prior information is used.

On the other hand, the information of the cluster itself may be used to determine the appropriate number of clusters. For example, distributed information within and between clusters may be used. In the case of inter cluster variance information, the distance of each cluster is maximized. As clustering works well, each cluster can be thought to be clearly separated. That is, the distance between clusters is far. On the other hand, inner-cluster variance information refers to the distance between terms forming one cluster. If variance in clusters is minimized, one cluster can be said to form clusters of terms of similar nature.

In the present invention, clustering is used to classify other terms that are similar to the query entered by the user. That is, an object of the present invention is to cluster related terms according to the use of related terms, and to provide information about the related terms. Therefore, there is a need for a clustering and evaluation method that can properly classify the meanings of related terms found according to a user's query term.

For example, the Elbow method can be used to determine the appropriate number of clusters. The elbow method is a method of determining the number of clusters by comparing a cluster with a previous cluster. For example, when increasing the number of clusters (K) on the X-axis in order from 1 to graphing the evaluation value on the Y-axis, the K value at the point where the graph is bent (elbow) is selected. In this case, a DB index (Davies-Bouldin index) may be used to evaluate cluster suitability according to the number of clusters.

The DB index is a measure of how well clustering works. It is a method of clustering datasets and evaluating them. The DB index has high intra-cluster similarity in clusters and gives low scores to results with low inter-cluster similarity between clusters.

The DB index can be calculated using Equation 1 below.

Figure 112018031336919-pat00015

n is the number of clusters,

Figure 112018031336919-pat00016
Is the center point of cluster x,
Figure 112018031336919-pat00017
Is the center point from all data objects in cluster x
Figure 112018031336919-pat00018
Is the average of the distances to
Figure 112018031336919-pat00019
Is the center point
Figure 112018031336919-pat00020
And center point
Figure 112018031336919-pat00021
Distance between.

Clustering that has high similarity in clusters and low similarity between clusters has a low DB index. Therefore, a clustering method with a low DB index can be evaluated as a good clustering method.

In the present invention, the DB index is calculated while increasing the number of clusters, and the number of clusters can be determined by comparing the value of the previous stage with the value of the current stage. The method using the DB index is not an evaluation of the true true value of the cluster, but it has the advantage of being able to quickly determine the number.

3 through 6 illustrate examples of clustering terms associated with a query input by a user using the above-described method. The terms associated with the query words refer to terms that have high similarity with the query word.

The related term search may be performed in two ways: 'single term search' (example of FIGS. 3 to 5) and 'compound term search' (example of FIG. 6). Single term search is when the user's query is one specific term and performs similarity measurement with other terms based on the word vector corresponding to the user's query. The compound term search refers to a case where a plurality of words are input by a query.

3 illustrates an example of a related term list when a query input by a user is a doctor. Referring to FIG. 3, it can be seen that the related terms of the first group (group 1) are related to a doctor of a hospital such as a patient, a medical staff, a medical treatment, a disease name, and a nurse. On the other hand, the second and third groups can see that words related to the intention of decision making such as decision makers, decision makers, decision makers, and choices are listed.

That is, when the user inputs a query of 'doctor', 'doctor' may mean a doctor of decision making or a doctor of a hospital. According to the present invention, it is possible to assist the user in determining the linked terms clustered differently for each homonym to the user.

In addition, when determining the number of clustering in the example of Figure 3, while increasing the number of clustering to two (for example, cluster 1 and the remaining cluster), three (group 1, cluster 2, cluster 3), etc. Can be evaluated separately, and the number of clustering can be determined at the elbow where the change of the evaluation index is the most severe.

4 illustrates an example of a related term list when a query input by a user is 'memory'. In FIG. 4, cluster 1 includes terms related to access, cluster 2 includes terms related to nonvolatile, NAND, and flash, and cluster 3 includes terms related to bits, addresses, and programming.

FIG. 5 illustrates an example of a related term list when a user input query is 'graphene'. In FIG. 5, cluster 1 includes terms associated with carbon nano, nano, carbon nano, etc., cluster 2 includes terms associated with graphene, graphene layers, nanowires, etc., and cluster 3 includes terms related to exfoliation and direct growth. .

As shown in FIGS. 4 and 5, each cluster may represent linked words of terms having different meanings, and may also suggest various methodologies associated with domains.

FIG. 6 illustrates a result of a complex query of a “hospital” and a “doctor”, unlike FIGS. 3 to 5, which were a case of a single term search. In the example of FIG. 3, a single query 'doctor' has two meanings, and it can be seen that an intention and a doctor in a hospital are separately clustered. However, if the word 'hospital' is entered as 'doctor', the meaning of 'doctor' becomes clear as the doctor of the hospital and the related terms related to 'doctor' and 'hospital' can be found. In the case of the first cluster (Cluster 1), a list of terms related to care was created, and the second cluster (Cluster 2) was clustered into domain-linked terms in the hospital such as nurse and care.

The present invention can provide an associated term search service that can dynamically identify terms through learning about user input content or documents. The association term search service allows a user to be provided with term information related to the query. Also, the cluster terminology information related to the query may be used to find information about the cluster term of the query term and the related term.

7 illustrates a computing device 700 in accordance with one embodiment of the present invention. Referring to FIG. 7, a computing device 700 according to an embodiment of the present invention may include a processor 710, a memory 720, and a transceiver 730.

The processor 710 may control the operation of the computing device 700. In addition, the processor 710 may be configured to implement the procedures and / or methods proposed in the present invention. The memory 720 may store information for implementing the procedures and / or methods proposed by the present invention, and may be a nonvolatile or volatile memory. The transceiver 730 may transmit or receive various signals, data, and information with an external device or another device therein.

The document similarity analysis method according to various embodiments of the present disclosure may be implemented in the form of program instructions that may be executed through computer means such as various servers. In addition, a program for executing the document similarity analysis method according to the present invention may be installed in computer means and recorded in a computer readable medium. Computer-readable media may include, alone or in combination with the program instructions, data files, data structures, and the like. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention, or may be known and available to those skilled in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Hardware devices specially configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory and the like.

So far, the present invention has been described with reference to the embodiments. However, the above description of the present invention is for illustration, and those skilled in the art may understand that the present invention can be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. Could be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. The scope of the invention is indicated by the claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the invention.

Claims (10)

In the method of establishing a dynamic term identification system,
Processing the user generated content in the form of terms;
Converting the processed term into a vector of vector space;
Extracting candidate words associated with a query based on the vector to form a cluster; And
Determining the total number of clusters based on the result of evaluating the clusters
Including,
The result of evaluating the cluster includes a DBs index (Davies Bouldin index) for the cluster,
The DB index (Davies Bouldin index, DB) is determined using the following equation 1,
[Equation 1]
Figure 112019030537811-pat00043

N is the number of clusters,
Figure 112019030537811-pat00044
Is the center point of cluster x,
Figure 112019030537811-pat00045
Is the center point from all data objects in cluster x
Figure 112019030537811-pat00046
Is the average value of the distance up to
Figure 112019030537811-pat00047
Is the center point
Figure 112019030537811-pat00048
And center point
Figure 112019030537811-pat00049
And a distance between the two.
delete delete The method of claim 1,
Determining the total number of clusters,
Calculating a first result evaluated when the total number of clusters is a natural number N;
Calculating a second result evaluated when the total number of clusters is N + 1; And
Comparing the first result and the second result to determine the total number of clusters.
The method of claim 1,
The result of evaluating the cluster,
Inter-cluster distribution information on the distance between each cluster; And
A method of constructing a dynamic term identification scheme, determined based on variance information in a cluster of distances between vectors corresponding to each term in a cluster.
In the apparatus for constructing a dynamic term identification system,
A transceiver for transmitting and receiving data; And
Includes a processor,
The processor is
Process user-generated content into terms,
Convert the processed term into a vector of vector space,
Extracting candidate words associated with the query based on the vector to form a cluster,
Configured to determine the total number of clusters based on the result of evaluating the clusters,
The result of evaluating the cluster includes a DBes index (Davies Bouldin index) for the cluster,
The DB index (Davies Bouldin index, DB) is determined using the following equation 1,
[Equation 1]
Figure 112019030537811-pat00050

N is the number of clusters,
Figure 112019030537811-pat00051
Is the center point of cluster x,
Figure 112019030537811-pat00052
Is the center point from all data objects in cluster x
Figure 112019030537811-pat00053
Is the average value of the distance up to
Figure 112019030537811-pat00054
Is the center point
Figure 112019030537811-pat00055
And center point
Figure 112019030537811-pat00056
And a distance between the two.
delete delete The method of claim 6,
The processor is configured to determine the total number of clusters,
Calculating a first result evaluated when the total number of clusters is a natural number N,
Calculating a second result evaluated when the total number of clusters is N + 1,
And compare the first result and the second result to determine the total number of clusters.
The method of claim 6,
The result of evaluating the cluster,
Inter-cluster distribution information on the distance between each cluster; And
And determine based on variance information in a cluster about a distance between vectors corresponding to each term in one cluster.
KR1020180036501A 2017-03-29 2018-03-29 Device and method for constructing dynamic-terms identification system in user gererated contents KR102025819B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020170040261 2017-03-29
KR20170040261 2017-03-29

Publications (2)

Publication Number Publication Date
KR20180110639A KR20180110639A (en) 2018-10-10
KR102025819B1 true KR102025819B1 (en) 2019-09-26

Family

ID=63876110

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020180036501A KR102025819B1 (en) 2017-03-29 2018-03-29 Device and method for constructing dynamic-terms identification system in user gererated contents

Country Status (1)

Country Link
KR (1) KR102025819B1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102217307B1 (en) * 2019-07-03 2021-02-18 인하대학교 산학협력단 Machine Learning and Semantic Knowledge-based Big Data Analysis: A Novel Healthcare Monitoring Method and Apparatus Using Wearable Sensors and Social Networking Data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000172701A (en) * 1998-12-04 2000-06-23 Fujitsu Ltd Document data providing device, document data providing system, document data providing method and storage medium recording program providing document data
KR101190230B1 (en) * 2004-07-26 2012-10-12 구글 인코포레이티드 Phrase identification in an information retrieval system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08221447A (en) * 1995-02-10 1996-08-30 Canon Inc Automatic document sorting device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000172701A (en) * 1998-12-04 2000-06-23 Fujitsu Ltd Document data providing device, document data providing system, document data providing method and storage medium recording program providing document data
KR101190230B1 (en) * 2004-07-26 2012-10-12 구글 인코포레이티드 Phrase identification in an information retrieval system

Also Published As

Publication number Publication date
KR20180110639A (en) 2018-10-10

Similar Documents

Publication Publication Date Title
CN107861939B (en) Domain entity disambiguation method fusing word vector and topic model
Vijaymeena et al. A survey on similarity measures in text mining
Amir et al. Quantifying mental health from social media with neural user embeddings
US10831762B2 (en) Extracting and denoising concept mentions using distributed representations of concepts
Tan et al. Interpreting the public sentiment variations on twitter
Ye et al. Linguistic structures as weak supervision for visual scene graph generation
US20170161619A1 (en) Concept-Based Navigation
KR102703923B1 (en) Apparatus and method for learning narrative of document, apparatus and method for generating narrative of document
RU2583716C2 (en) Method of constructing and detection of theme hull structure
WO2020198855A1 (en) Method and system for mapping text phrases to a taxonomy
CN112818670B (en) Segmentation grammar and semantics in a decomposable variant automatic encoder sentence representation
CN110705247B (en) Based on x2-C text similarity calculation method
Liu et al. Clustering documents with labeled and unlabeled documents using fuzzy semi-Kmeans
Balasubramanian et al. A multimodal approach for extracting content descriptive metadata from lecture videos
CN111881292B (en) Text classification method and device
Yan et al. Biomedical literature classification with a CNNs-based hybrid learning network
Karami Fuzzy topic modeling for medical corpora
Hajarolasvadi et al. Video-based person-dependent and person-independent facial emotion recognition
Gao et al. A cross-domain adaptation method for sentiment classification using probabilistic latent analysis
Zhang et al. A topic clustering approach to finding similar questions from large question and answer archives
Chang et al. Using word semantic concepts for plagiarism detection in text documents
Zhou et al. A text sentiment classification model using double word embedding methods
Biesialska et al. Leveraging contextual embeddings and self-attention neural networks with bi-attention for sentiment analysis
Li et al. bi-hptm: An effective semantic matchmaking model for web service discovery
Wang et al. A meta-learning based stress category detection framework on social media

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right