WO2020111395A1

WO2020111395A1 - Device and method for term clustering of unstructured text data for big data analysis

Info

Publication number: WO2020111395A1
Application number: PCT/KR2019/002778
Authority: WO
Inventors: 황덕열; 공성원; 김세경
Original assignee: (주) 위세아이텍
Priority date: 2018-11-26
Filing date: 2019-03-11
Publication date: 2020-06-04
Also published as: KR101975419B1

Abstract

The present invention relates to a device for term clustering of unstructured text data for big data analysis, and the device for term clustering of unstructured text data for big data analysis may comprise: a database including a data set; a data preprocessing unit for selecting and preprocessing data from a data set included in the database; a recommended term determination unit for separating morphemes of original terms included in preprocessed data, and calculating recommended scores of the original terms; and a data clustering unit for separating phonemes of original terms, calculating the similarity between the respective original terms separated into phonemes, and clustering an original term having a similarity calculation value equal to or greater than a preconfigured threshold value, wherein the recommended term determination unit determines a recommended term among the plurality of original terms on the basis of the recommended scores and a result of the clustering.

Description

Apparatus and method for clustering terminology of unstructured text data for big data analysis

The present invention relates to an apparatus and method for clustering terms of unstructured text data for big data analysis.

For big data analysis, 70% to 80% of the total effort is used for data preprocessing. Big data analysis is explosively increasing, but as the development of big data analysis technology, the development rate of data preprocessing is slow, and accordingly, the need to develop automated data preprocessing technology is emerging.

Despite the increasing demand for unstructured data analysis in connection with the open environment of public information, the importance of preprocessing unstructured data is being emphasized, but most of the preprocessing is done manually.

Fuzzy Matching algorithm is the most used algorithm to calculate the similarity between data in text data. This algorithm is an algorithm that calculates the similarity between data using the result value calculated based on the edit distance (Levenshtein Distance).

Fuzzy Matching algorithm is an algorithm that simply calculates the similarity between two data. By applying this algorithm, data having a certain similarity in the data are clustered. In addition, since the Fuzzy Matching algorithm is developed based on English, when applied to Korean, it has a problem of calculating the similarity based on syllables, not phonology.

In addition, morpheme analysis is the most important technique in natural language processing, separating words or sentences into morphemes, which are the smallest units of words, and determining the part-of-speech speech. By checking the frequency through morpheme analysis in the data set, the key morphemes in the data set can be identified.

The background technology of the present application is disclosed in Korean Patent Publication No. 10-2016-0075974.

The present application is to solve the above-described problems of the prior art, in order to overcome the difficulties of preprocessing unstructured text in big data, unstructured text data for big data analysis that can facilitate big data analysis by clustering similar terms in the data It is intended to provide a term clustering device and method.

The present application is intended to solve the problems of the prior art described above, and recommends representative terms when clustering terms in a data set, thereby analyzing big data that can reduce the time of the unstructured data preprocessing process that the user must manually perform. It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.

The present application is to solve the above-described problems of the prior art, and by providing a method of automatically recommending a representative term to a user by applying a morpheme analysis, big data analysis that can reduce the time for the user to select a representative word It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.

In order to solve the problem of calculating the edit distance of Korean data in syllable units in the process of calculating similarity, the present application is to solve the problems of the prior art described above, and clustering using a method of calculating each syllable by separating it into phonons, An object of the present invention is to provide an apparatus and method for clustering terminology of unstructured text data that can help to standardize unstructured data by correcting human errors such as typos.

However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

As a technical means for achieving the above technical problem, a term clustering apparatus of unstructured text data for big data analysis according to an embodiment of the present application includes: a database including a data set, a data set included in the database Data pre-processing unit for selecting data and performing pre-processing, separating the morphemes of the original terms included in the pre-processed data, separating the phonetic terms of the original terms, and separating the phonetic terms of the original terms Comprising a similarity operation between each of the original terms, and the similarity calculation value includes a data clustering unit for clustering the original terms above a preset threshold, wherein the recommendation term determining unit is based on the recommendation score and the clustering result. You can decide the recommended term among the original terms.

The pre-processing unit according to an embodiment of the present application determines a first data set to perform term clustering among data sets included in the database, and performs term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column.

The pre-processing unit according to an embodiment of the present application may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the term.

The recommendation term determining unit according to an embodiment of the present application calculates a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the separated morphemes, and the weights of the morphemes in the original term It may be a ratio of specific gravity.

The recommendation score according to an embodiment of the present application is based on a value obtained by quantifying the extraction frequency of the separated morphemes, and the frequency of extraction of the separated morphemes and the length of each of a plurality of morphemes classified in the first original term. It may be calculated using the sum of the results obtained by dividing by the total length of the term.

When the data is Korean, the data clustering unit according to an embodiment of the present disclosure may be divided into phonons according to Hangeul alphabets as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm.

The term clustering apparatus according to an embodiment of the present application further includes a data result unit providing a cluster result of the data cluster, but the cluster result may include a recommended term, an original term, and a similarity value.

The term clustering apparatus according to an embodiment of the present application may further include a user input receiver configured to receive term clustering performance information from a user terminal.

The recommended term determining unit according to an embodiment of the present application may perform morpheme separation based on the pre-processed data based on part-of-speech determination information included in the term clustering performance information when morpheme separation.

The term clustering method of unstructured text data for analyzing big data according to an embodiment of the present application includes: selecting data from a data set included in a database and performing pre-processing, morphemes of original terms included in pre-processed data Separating, calculating a recommendation score of the original term, separating the phonology of the original term, performing similarity calculation between each original term separated by phonology, and clustering the original term whose value of the similarity calculation is greater than or equal to a preset threshold. The method may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.

The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

According to the above-described problem solving means of the present application, by selecting an item from a database data set, selecting a recommended term through morphological analysis, converting syllables into phonological units, performing term clustering, and similar terms in the data set. Can be clustered.

According to the above-mentioned problem solving means of the present application, after performing morphological analysis, term clustering may be performed to provide a recommendation term for which a priority has been set to a user, thereby recommending and clustering terms with high precision.

According to the above-described problem solving means of the present application, since terms are grouped by calculating similarity in phonological units, notation errors such as typos and the like can also be substituted for recommended terms in the data set.

According to the above-mentioned problem solving means of the present application, it is possible to more accurately perform unstructured big data classification by clustering errors in data or terms expressed in different terms as recommended terms.

However, the effects obtainable herein are not limited to the above-described effects, and other effects may exist.

1 is a schematic configuration diagram of a term clustering device according to an embodiment of the present application.

2 is a diagram schematically showing a part of a data item to perform clustering of a term clustering apparatus according to an embodiment of the present application.

3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application.

FIG. 4 is a diagram exemplarily showing results of ranking in the reverse order of the frequency of morphemes in the term clustering apparatus according to an embodiment of the present application.

FIG. 5 is a diagram exemplarily showing a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.

FIG. 6 is a diagram schematically illustrating a ranking of recommended terms according to calculation of a recommended score of terms in a term clustering apparatus according to an embodiment of the present application.

FIG. 7 is a view illustratively illustrating phonology of recommended terms in the term clustering apparatus according to an embodiment of the present application.

8 is a view showing results of the term clustering in the term clustering device according to an embodiment of the present application by way of example.

9 is an operation flowchart of a method for clustering terms of unstructured text data for big data analysis according to an embodiment of the present application.

Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present application pertains may easily practice. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

Throughout this specification, when a part is "connected" to another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element in between. "It also includes the case where it is.

Throughout the present specification, when a member is positioned on another member “on”, “on the top”, “top”, “bottom”, “bottom”, “bottom”, this means that one member is attached to another member. This includes cases where there is another member between the two members as well as when in contact.

Throughout this specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated.

Referring to FIG. 1, the term clustering apparatus 100 includes a database 110, a data preprocessing unit 120, a recommended term determination unit 130, a data clustering unit 140, a data result unit 150, and user input It may include a receiving unit 160.

According to an embodiment of the present application, the data term clustering apparatus 100 may select a single column item from the data set, and first select a recommended term through weight calculation according to morpheme analysis within the selected column. The data term clustering apparatus 100 may cluster the original term based on the recommended term and similarity calculation. The similarity calculation may include a pre-processing process for separating syllables into phonemes. In addition, if the recommended terms do not represent the original clustered terms, the user can cluster the original terms by entering an arbitrary recommended term.

In addition, the data term clustering apparatus 100 may cluster original terms in a column using an automated term clustering algorithm in a column of a selected data set. In addition, the data term clustering apparatus 100 may provide a terminology clustering method of unstructured text data for big data analysis in consideration of user convenience by providing a recommended term using weight calculation using morphological analysis.

The database 110 may include a data set used for term clustering. The database 110 may include unstructured data. Unstructured data, unstructured data, and unstructured data may refer to information that does not have a predefined data model or is not organized in a predefined manner. Unstructured data may refer to unstructured data having different shapes and structures, such as pictures, images, and documents, unlike numeric data having a certain standard or shape.

Referring to FIG. 2, a data set included in the database 110 may include two column items or more column items. Column items included in the data set can be divided into a representative key and a general column. For example, the representative key of the column item of the data set of FIG. 2 may be “patient ID”, and the general column may be “bottle name”. In this case, in the case of the general column'Byeongmyeong', it may consist of unstructured text data. Unstructured text data may include terms.

The data pre-processing unit 120 may select data from a data set included in the database 110 and perform pre-processing. The data preprocessing unit 120 may select and select a column to perform data clustering among a plurality of columns of the data set stored in the database 110.

In other words, the data preprocessing unit 120 determines a first data set to perform term clustering among data sets included in the database 110, and performs a term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column. For example, referring to FIG. 2, the data preprocessing unit 120 determines a first data set to perform term clustering among a plurality of data sets included in the database 110 and determines term clustering of the first data set. By selecting the'Bill Name' column to be performed, data pre-processing of the selected'Bill Name' column item can be performed.

The data pre-processing unit 120 may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the terms. In other words, the data pre-processing unit 120 may perform a pre-processing process of removing duplicate data and null values (data not including terms) of the determined column data.

According to an embodiment of the present application, in the case of a term in which the form is completely matched, clustering is unnecessary, and a null value corresponding to a blank (data not including a term) is also data in which term clustering is unnecessary, so the data preprocessing unit 120 Null values can be removed. In addition, the user can substitute a null value (data that does not include a term) with another term as needed.

For example, the user input receiving unit 160 may receive term clustering information from a user terminal. The term clustering information may include alternative terms for replacing data that does not contain terms with other terms. In other words, the data preprocessing unit 120 may input an alternative term into data that does not include the alternative term provided from the user input receiving unit 160.

3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application, and FIG. 4 exemplarily shows the results of ranking in the reverse order of the frequency of morphemes in the term clustering device according to an embodiment of the present application 5 is a diagram illustrating a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.

Referring to FIG. 3, the recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data. The recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data. Separation of morphemes may be to separate sentences or text into the smallest units and automatically determine the part of speech of the morpheme. In addition, the morpheme separation may be to separate the morpheme into the smallest unit having meaning using the original term. Morphological analysis is the most basic technique for natural language processing.

For example, referring to FIG. 3, the recommended term determining unit 130 may divide the “colon polyp removal” into “colon”, “polyp”, and “removal”. In addition, the recommended term determining unit 130 may divide the'unspecified pneumonia' into'detailed','unknown', and'pneumonia'. The recommended term determining unit 130 may perform morpheme separation based on a specific part of speech.

According to an embodiment of the present application, the separated morphemes may be used to prioritize recommended terms using weights. The recommended term determining unit 130 may rank the frequencies of the separated morphemes (Rank). Through this, the user can check the most frequently used morpheme in the column (data) selected by the user. For example, the recommended term determining unit 130 may arrange the results of ranking by sorting the frequency (Rank) of the separated morphemes as shown in FIG. 4.

The recommended term determination unit 130 may calculate the recommended score of the original term using the separated morpheme. The recommendation term determining unit 130 may calculate a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the morpheme. For example, the value obtained by quantifying the extraction frequency of the separated morphemes may mean a value of Rank shown in FIG. 4. In this case, the weight may be a value obtained by proportioning the proportion of morphemes in the original term. As an example, as shown in FIG. 4, the recommended term determining unit 130 ranks the reverse order of the frequencies so that the higher the frequency, the higher the ranking (1st rank), and ratios the weight of morphemes in all terms to the weight. Can be used. In other words, the recommended term determining unit 130 may quantify the ranking of the separated morphemes in the reverse order, so that the higher the frequency, the higher the morpheme score may be to determine the score of the morpheme. The weight of the morpheme may be calculated by taking the specific gravity of the length of each morpheme in the term containing the morpheme. The recommended term determining unit 130 may calculate the recommended score of the term by multiplying the weight and the morpheme score and adding it from the entire term.

The recommended score is a result obtained by dividing the frequency of extraction of the separated morphemes and the length of each of the plurality of morphemes classified in the first original term by the total length of the first original term based on the numerical value of the extraction frequency of the separated morphemes. It may be calculated using the sum of. For example, when the first morpheme includes a first morpheme or a third morpheme, the length of the first morpheme is divided by the total length of the first morpheme, and as a result, the length of the second morpheme is the first original term The recommended score of the term can be calculated using the sum of the result of dividing by the total length of and the result of dividing the length of the third morpheme by the total length of the first original term.

For example, the recommendation score may be expressed as [Equation 1].

[Equation 1]

Here, n is the number of morphemes separated from the term, and rank is the order of frequency (reverse order).

For example, referring to FIGS. 4 and 5, the numbers ranking the frequencies of'detail','unknown', and'pneumonia' are 402, 399, and 330, respectively, and the total length of'detailed pneumonia' is spaced apart. Including 7 characters and separated morphemes are 2 characters each. The weight is weighted by dividing the length of the morpheme letter by 2 and the total length of 7. As a result, the recommended score for'Detailed Pneumonia' is 323 points, which is a number that is multiplied by 2/7 to 402, 399, and 330, which are the rankings of frequencies of'Detailed','Unknown', and'Pneumonia'.

According to an embodiment of the present application, the recommended term determining unit 130 calculates the frequency in consideration of all parts of speech during morpheme analysis, but is not limited thereto. For example, the recommended term determining unit 130 may perform a morpheme analysis by determining the part of speech based on the part of speech determination information included in the term clustering performance information provided through the user input receiving unit 160. In other words, the recommended term determining unit 130 may perform morpheme separation based on pre-processed data based on part-of-speech determination information included in term clustering performance information when morpheme separation.

Referring to FIG. 6, the recommendation term determining unit 130 may prioritize and sort terms having a high recommendation score of the calculated term by applying the calculation method of the recommendation score described above.

According to one embodiment of the present application, the recommended term determining unit 130 may select the recommended term by calculating the recommended score using the frequency and weight of the separated morphemes. In addition, the recommended term determining unit 130 may determine the priority of the recommended term using the weighted and ranked frequency. Also, the recommended term determining unit 130 may determine the priority of the recommended term by using weights using the morphemes ranked and the lengths of the morphemes.

Referring to FIG. 7, the data cluster 140 may separate the phonology of the original term and perform similarity calculation between each original term separated by phonology. When the original terminology is Hangul, the data cluster 140 may be divided into phonology according to the Hangul alphabet as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm. The data cluster 140 may perform similarity calculation between each original term using the Fuzzy Data Matching algorithm, but is not limited thereto. Fuzzy Data Matching Algorithm is an algorithm that performs matching between data using the calculated result based on the edit distance (Levenshtein Distance).

The data cluster 140 may divide terms consisting of syllables into supersonic, neutral, and final phonology according to the Hangul alphabet. This is because the Fuzzy Data Matching algorithm, which is an artificial intelligence-based algorithm, uses a method of calculating similarity based on the shape of each word. Unlike English, which is the basic language of this algorithm, in the case of Korean, Korean alphabet corresponding to the English alphabet is combined to produce letters, so the data cluster 140 solves the syllables of Hangeul like alphabets to make Korean alphabets. Similarity can be calculated by separating. For example, when calculating similarity without phonological separation,'strong' and'ball' are completely different letters, but after phonological separation,'ㄱㅏㅇ' and'ㄱㅗㅇ' are similar letters that differ only in the middle, so the data cluster 140, if the original term is Hangul, it can be separated into phonology according to the Hangul alphabet.

Referring to FIG. 7 as an example, the data cluster 140 may calculate similarity by separating it into'ㅅㅏㅇㅅㅔㅂㅜㄹㅁㅕㅇ ㅍㅖㄹㅕㅁ' in the case of'detailed unknown pneumonia'. However, in the case of English, this process can be omitted.

According to one embodiment of the present application, the data clustering unit 140 may calculate and group the similarity between unstructured data selected by the user. In addition, the data clustering unit 140 may calculate the similarity between terms separated by phonology, and when a certain similarity value is exceeded, the data clustering unit 140 may cluster the recommended terminology according to the priority of the recommended terms. In other words, the data clustering unit 140 may cluster original terms whose similarity calculation values are greater than or equal to a preset threshold. The threshold may be modified and changed according to the user's convenience. In other words, the threshold value may be changed based on the threshold correction information included in the term clustering performance information received by the user input receiving unit 160.

For example, referring to FIG. 7, the data cluster 140 may perform similarity calculation of each of a plurality of terms. For example, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the second original term (backbone sprain). Also, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the third original term (acute hepatitis). The data cluster 140 may perform similarity calculation for each of the first original term and the nth original term included in the column. The similarity calculation performed by the data cluster 140 may then be used to determine the recommended term of the recommended term determining unit 130.

According to one embodiment of the present application, the data clustering unit 140 may separate and cluster the recommended term priority data determined by the recommended term determining unit 130 in phonological units. The clustering can cluster the original term when a certain degree of similarity is exceeded based on the similarity based on the edit distance.

Referring to FIG. 8, the recommended term determining unit 130 may determine the recommended term among the plurality of original terms based on the recommendation score and the clustering result.

The recommended term determination unit 130 includes the first clustering result 11 including the first original term (1), the second original term (2), the third original term (3), the fourth original term (4), etc. Based on, it is possible to determine a recommended term among a plurality of original terms. The recommended term determining unit 130 includes the first original term (1), the second original term (2), the third original term (3), the fourth original term (4) included in the first clustering result (11), The recommended term may be determined based on the recommendation scores of the fifth original term (5), the sixth original term (6), and the seventh original term (7). The recommended term determining unit 130 may select the original term having the highest recommended score among the original terms included in the first clustering result 11 as the recommended term. The first clustering result 11 may be a result of clustering the original terms above a predetermined threshold in the data clustering unit 140.

For example, the recommended score of the first original term (unspecified pneumonia) is 377, the recommended score of the second original term (unspecified pneumonia) is 323, and the recommended score of the third original term (unspecified pneumonia) is 323, the recommended score of the fourth original term (unspecified bacterial pneumonia) may be 310. The recommended term determining unit 130 may determine the first original term (detailed pneumonia) having the highest recommended score from the first original term to the fourth original term as the recommended term. The similarity value may be a similarity value between the first original term (unspecified pneumonia) and the second original term (unspecified pneumonia) determined as a recommended term. In other words, the recommended term determining unit 130 selects the original term having the highest recommended score among the plurality of original terms included in the first clustering result 11 as the recommended term, and the similarity is similarity between the recommended term and the original term. It can be a value.

For example, referring to FIG. 8, data clustered on the basis of the recommended terms'detailed pneumonia','detailed pneumonia','detailed pneumonia','detailed pneumonia','detailed unknown pneumonia','detailed' It is'unexpected pneumonia','unspecified pneumonia', and'unspecified pneumonia'. It can be seen that the original terms clustered in the recommended terms are terms similar to'detailed pneumonia' except for spacing and research, which are differently expressed according to the user who entered the original term.

In addition, the terms (data) clustered based on the recommended term'backbone sprain' are'neck back sprain, sprain back sprain,' neck sprain, back sprain, and neck sprain | These are the lumbar sprains, the'sprains of the neck and back bones', the'sprains of the neck/waist bones,' the sprains of the neck and back bones, and the'sprain sprains | the back sprains'. As such, by using special symbols, spaces, investigations, and conjunctions included in the terms, it can be seen that terms expressed in different forms are clustered into the recommended term'neck lumbar sprain'.

For example, referring to FIG. 8, the data result unit 150 may provide a cluster result of the data cluster 140. Cluster results may include recommended terms, original terms, and similarity values. Here, the original term may be a term included in the column. In other words, the original term may be data of an initial value included in the database 110. Illustratively, the disease name may correspond to the original term. The recommended term may be a term determined among a plurality of original terms based on a recommendation score and a clustering result. The similarity may be a similarity value between the recommended term and the original term. The similarity may be a similarity value between each original term calculated by separating into phonons in the data cluster 140. The user may provide the data result unit 150 in the form of a percentage between the original term clustered according to the recommended term and the recommended term and the original term.

For example, the similarity value between the recommended term'detailed unknown pneumonia' and the first original term'detailed unknown pneumonia' in the first clustering result 11 may be 100. In addition, the similarity value between the recommended term'detailed pneumonia' in the first clustering result 11 and the second source term'detailed pneumonia' may be 94. In addition, the similarity value between the recommended term'detailed pneumonia' of the first clustering result 11 and the third source term'detailed pneumonia' may be 100. If the existing clustering method is applied, the recommended terms'detailed pneumonia' and the third source term'detailed pneumonia' are the same terms including spaces, but there may be a problem that clustering does not occur. By separating into the phonology of the data cluster 140 and performing similarity calculation, it is possible to solve the problem of not clustering due to spacing.

According to one embodiment of the present application, the data result unit 150 may check, store, and modify the results of the data cluster 140. The clustering result that can be confirmed in the data result unit 150 is a result of clustering data included in a column item of the selected data set. The clustered result shows the original data contained in the column items of the data set, and the similarity between the recommended term and the recommended term and the original data. The user can check the clustering result in the data result section and correct the recommended term.

According to one embodiment of the present application, the data result unit 150 may modify the recommended term based on the data term clustering result. The data result unit 150 may modify the recommended term by requesting to modify the recommended term provided from the user terminal. In other words, the recommended term is a term that the user wants to standardize for convenience or the user can modify it.

According to one embodiment of the present application, the data result unit 150 may store a term (data) clustering result. The term (data) clustering can be stored in a form desired by the user, such as the database 110, for targeted data. At this time, instead of revising the recommended term instead of the original term, the user can reconfirm the cluster result by storing in a new column (column containing the recommended term).

According to an embodiment of the present disclosure, the user input receiving unit 160 may receive term clustering performance information from a user terminal (not shown). The term clustering performance information may include user input information for determining a first data set to perform term clustering among data sets included in the database. In addition, the term clustering performance information may include originals included in preprocessed data. When separating the morphemes of the term, it may include user input information for determining part of speech. In addition, the term clustering performance information may include user input information for setting a threshold for distinguishing similarity between each original term separated by phonology.

According to an embodiment of the present application, the user input receiving unit 160 may provide a term clustering menu to a user terminal (not shown). For example, a user terminal (not shown) downloads and installs an application program provided by the term clustering device 100, and a term clustering menu may be provided through the installed application.

The user input receiving unit 160 may transmit and receive data, contents, and various communication signals to and from a user terminal (not shown) through a network, and include all types of servers, terminals, or devices having functions of data storage and processing. .

A user terminal (not shown) is a device interworking with the user input receiving unit 160 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and a PCS (Personal Communication System). ), Global System for Mobile Communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000 , W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet), and all kinds of wireless communication devices and desktop computers, fixed terminals such as smart TVs.

Examples of networks for sharing information between the user input receiver 160 and a user terminal (not shown) include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, and World Interoperability for Microwave Access (WIMAX). ) Network, Wired and Wireless Internet, Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth Network, Wifi Network, NFC( Near Field Communication (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, and the like may be included, but are not limited thereto.

Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

The term clustering method of unstructured text data for analyzing big data shown in FIG. 9 may be performed by the term clustering device 10 described above. Therefore, even if omitted, the description of the term clustering device 10 may be equally applied to the description of the term clustering method of unstructured text data for big data analysis.

In step S901, the term clustering apparatus 10 may select data from a data set included in the database and perform preprocessing.

In step S902, the term clustering device 10 may separate the morphemes of the original terms included in the pre-processed data, and calculate a recommendation score of the original terms.

In step S903, the term clustering apparatus 10 may separate the phonology of the original term, perform similarity calculation between each original term separated by phonology, and cluster the original term in which the similarity calculation value is greater than or equal to a preset threshold. .

In step S904, the term clustering apparatus 10 may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.

In the above description, steps S901 to S904 may be further divided into additional steps or combined into fewer steps, depending on the implementation herein. Also, some steps may be omitted if necessary, and the order between the steps may be changed.

The term clustering method of unstructured text data for big data analysis according to an embodiment of the present application may be implemented in the form of program instructions that can be performed through various computer means and may be recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

In addition, the method for clustering terms of unstructured text data for big data analysis described above may also be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

The foregoing description of the present application is for illustration, and a person having ordinary knowledge in the technical field to which the present application belongs will understand that it is possible to easily change to other specific forms without changing the technical spirit or essential characteristics of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

The scope of the present application is indicated by the claims below, rather than the detailed description, and it should be interpreted that all modifications or variations derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present application.

Claims

In the apparatus for clustering terms of unstructured text data for big data analysis,

A database including a data set;

A data pre-processor for selecting data from the data set included in the database and performing pre-processing;

A recommendation term determination unit for separating the morphemes of the original terms included in the pre-processed data and calculating a recommendation score of the original terms; And

Data clustering unit that separates the phonology of the original term, performs similarity calculation between each original term separated by phonology, and clusters the original term whose value of the similarity calculation is greater than or equal to a preset threshold.

Including,

The recommended term determining unit determines a recommended term among a plurality of original terms based on the recommended score and the clustering result.
According to claim 1,

The pre-processing unit,

Determining a first data set to perform term clustering among data sets included in the database, and selecting a first column to perform term clustering among a plurality of column items of the first data set to preprocess data of the selected column item The term clustering device, which is to perform.
According to claim 1,

The pre-processing unit,

The term clustering device is to perform pre-processing to remove data that does not contain redundant terms and terms included in the selected column.
According to claim 1,

The recommended term determining unit,

The recommended score is calculated using the numerical value of the extraction frequency of the separated morphemes and the weight extracted based on the separated morphemes,

The weighting is a ratio of the proportion of morphemes in the original term, the term clustering device.
The method of claim 4,

The recommended score,

The sum of the result of dividing the frequency of extraction of the separated morphemes and the length of each of the plurality of morphemes classified in the first original term by the total length of the first original term based on the value obtained by quantifying the extraction frequency of the separated morpheme. The term clustering device, which is calculated using.
According to claim 1,

The data cluster,

When the original term is Hangul, separating into phonology according to Hangul Jamo as Choseong, Neutral, and Jongseong, and calculating the similarity using an artificial intelligence-based algorithm.
According to claim 1,

Further comprising a data result portion for providing a cluster result of the data cluster,

The clustering result includes a recommended term, an original term, and a similarity value, a term clustering device.
According to claim 1,

Further comprising a user input receiving unit for receiving the term clustering performance information from the user terminal, the term clustering device.
The method of claim 8,

The recommended term determining unit,

A term clustering device that performs morphological separation based on the pre-processed data based on part-of-speech determination information included in the term clustering performance information during morpheme separation.
In the method of clustering terms of unstructured text data for big data analysis,

Selecting data from a data set included in the database and performing pre-processing;

Separating the morphemes of the original terms included in the pre-processed data, and calculating a recommendation score of the original terms; And

Separating the phonology of the original term, performing similarity calculation between each original term separated by phonology, and clustering the original terms whose value of the similarity calculation is greater than or equal to a preset threshold;

And determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.