WO2020111395A1 - Device and method for term clustering of unstructured text data for big data analysis - Google Patents

Device and method for term clustering of unstructured text data for big data analysis Download PDF

Info

Publication number
WO2020111395A1
WO2020111395A1 PCT/KR2019/002778 KR2019002778W WO2020111395A1 WO 2020111395 A1 WO2020111395 A1 WO 2020111395A1 KR 2019002778 W KR2019002778 W KR 2019002778W WO 2020111395 A1 WO2020111395 A1 WO 2020111395A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
data
clustering
original
recommended
Prior art date
Application number
PCT/KR2019/002778
Other languages
French (fr)
Korean (ko)
Inventor
황덕열
공성원
김세경
Original Assignee
(주) 위세아이텍
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by (주) 위세아이텍 filed Critical (주) 위세아이텍
Publication of WO2020111395A1 publication Critical patent/WO2020111395A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/048Fuzzy inferencing

Definitions

  • the present invention relates to an apparatus and method for clustering terms of unstructured text data for big data analysis.
  • Big data analysis 70% to 80% of the total effort is used for data preprocessing. Big data analysis is explosively increasing, but as the development of big data analysis technology, the development rate of data preprocessing is slow, and accordingly, the need to develop automated data preprocessing technology is emerging.
  • Fuzzy Matching algorithm is the most used algorithm to calculate the similarity between data in text data. This algorithm is an algorithm that calculates the similarity between data using the result value calculated based on the edit distance (Levenshtein Distance).
  • Fuzzy Matching algorithm is an algorithm that simply calculates the similarity between two data. By applying this algorithm, data having a certain similarity in the data are clustered. In addition, since the Fuzzy Matching algorithm is developed based on English, when applied to Korean, it has a problem of calculating the similarity based on syllables, not phonology.
  • morpheme analysis is the most important technique in natural language processing, separating words or sentences into morphemes, which are the smallest units of words, and determining the part-of-speech speech. By checking the frequency through morpheme analysis in the data set, the key morphemes in the data set can be identified.
  • the present application is to solve the above-described problems of the prior art, in order to overcome the difficulties of preprocessing unstructured text in big data, unstructured text data for big data analysis that can facilitate big data analysis by clustering similar terms in the data It is intended to provide a term clustering device and method.
  • the present application is intended to solve the problems of the prior art described above, and recommends representative terms when clustering terms in a data set, thereby analyzing big data that can reduce the time of the unstructured data preprocessing process that the user must manually perform. It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.
  • the present application is to solve the above-described problems of the prior art, and by providing a method of automatically recommending a representative term to a user by applying a morpheme analysis, big data analysis that can reduce the time for the user to select a representative word It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.
  • An object of the present invention is to provide an apparatus and method for clustering terminology of unstructured text data that can help to standardize unstructured data by correcting human errors such as typos.
  • a term clustering apparatus of unstructured text data for big data analysis includes: a database including a data set, a data set included in the database
  • Data pre-processing unit for selecting data and performing pre-processing, separating the morphemes of the original terms included in the pre-processed data, separating the phonetic terms of the original terms, and separating the phonetic terms of the original terms
  • Comprising a similarity operation between each of the original terms, and the similarity calculation value includes a data clustering unit for clustering the original terms above a preset threshold, wherein the recommendation term determining unit is based on the recommendation score and the clustering result. You can decide the recommended term among the original terms.
  • the pre-processing unit determines a first data set to perform term clustering among data sets included in the database, and performs term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column.
  • the pre-processing unit may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the term.
  • the recommendation term determining unit calculates a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the separated morphemes, and the weights of the morphemes in the original term It may be a ratio of specific gravity.
  • the recommendation score according to an embodiment of the present application is based on a value obtained by quantifying the extraction frequency of the separated morphemes, and the frequency of extraction of the separated morphemes and the length of each of a plurality of morphemes classified in the first original term. It may be calculated using the sum of the results obtained by dividing by the total length of the term.
  • the data clustering unit may be divided into phonons according to Hangeul alphabets as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm.
  • the term clustering apparatus further includes a data result unit providing a cluster result of the data cluster, but the cluster result may include a recommended term, an original term, and a similarity value.
  • the term clustering apparatus may further include a user input receiver configured to receive term clustering performance information from a user terminal.
  • the recommended term determining unit may perform morpheme separation based on the pre-processed data based on part-of-speech determination information included in the term clustering performance information when morpheme separation.
  • the term clustering method of unstructured text data for analyzing big data includes: selecting data from a data set included in a database and performing pre-processing, morphemes of original terms included in pre-processed data Separating, calculating a recommendation score of the original term, separating the phonology of the original term, performing similarity calculation between each original term separated by phonology, and clustering the original term whose value of the similarity calculation is greater than or equal to a preset threshold.
  • the method may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.
  • term clustering may be performed to provide a recommendation term for which a priority has been set to a user, thereby recommending and clustering terms with high precision.
  • FIG. 1 is a schematic configuration diagram of a term clustering device according to an embodiment of the present application.
  • FIG. 2 is a diagram schematically showing a part of a data item to perform clustering of a term clustering apparatus according to an embodiment of the present application.
  • FIG. 3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application.
  • FIG. 4 is a diagram exemplarily showing results of ranking in the reverse order of the frequency of morphemes in the term clustering apparatus according to an embodiment of the present application.
  • FIG. 5 is a diagram exemplarily showing a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.
  • FIG. 6 is a diagram schematically illustrating a ranking of recommended terms according to calculation of a recommended score of terms in a term clustering apparatus according to an embodiment of the present application.
  • FIG. 7 is a view illustratively illustrating phonology of recommended terms in the term clustering apparatus according to an embodiment of the present application.
  • FIG. 8 is a view showing results of the term clustering in the term clustering device according to an embodiment of the present application by way of example.
  • FIG. 9 is an operation flowchart of a method for clustering terms of unstructured text data for big data analysis according to an embodiment of the present application.
  • FIG. 1 is a schematic configuration diagram of a term clustering device according to an embodiment of the present application.
  • the term clustering apparatus 100 includes a database 110, a data preprocessing unit 120, a recommended term determination unit 130, a data clustering unit 140, a data result unit 150, and user input It may include a receiving unit 160.
  • the data term clustering apparatus 100 may select a single column item from the data set, and first select a recommended term through weight calculation according to morpheme analysis within the selected column.
  • the data term clustering apparatus 100 may cluster the original term based on the recommended term and similarity calculation.
  • the similarity calculation may include a pre-processing process for separating syllables into phonemes.
  • the user can cluster the original terms by entering an arbitrary recommended term.
  • the data term clustering apparatus 100 may cluster original terms in a column using an automated term clustering algorithm in a column of a selected data set.
  • the data term clustering apparatus 100 may provide a terminology clustering method of unstructured text data for big data analysis in consideration of user convenience by providing a recommended term using weight calculation using morphological analysis.
  • the database 110 may include a data set used for term clustering.
  • the database 110 may include unstructured data.
  • Unstructured data, unstructured data, and unstructured data may refer to information that does not have a predefined data model or is not organized in a predefined manner.
  • Unstructured data may refer to unstructured data having different shapes and structures, such as pictures, images, and documents, unlike numeric data having a certain standard or shape.
  • FIG. 2 is a diagram schematically showing a part of a data item to perform clustering of a term clustering apparatus according to an embodiment of the present application.
  • a data set included in the database 110 may include two column items or more column items.
  • Column items included in the data set can be divided into a representative key and a general column.
  • the representative key of the column item of the data set of FIG. 2 may be “patient ID”, and the general column may be “bottle name”.
  • the general column'Byeongmyeong' it may consist of unstructured text data.
  • Unstructured text data may include terms.
  • the data pre-processing unit 120 may select data from a data set included in the database 110 and perform pre-processing.
  • the data preprocessing unit 120 may select and select a column to perform data clustering among a plurality of columns of the data set stored in the database 110.
  • the data preprocessing unit 120 determines a first data set to perform term clustering among data sets included in the database 110, and performs a term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column. For example, referring to FIG. 2, the data preprocessing unit 120 determines a first data set to perform term clustering among a plurality of data sets included in the database 110 and determines term clustering of the first data set. By selecting the'Bill Name' column to be performed, data pre-processing of the selected'Bill Name' column item can be performed.
  • the data pre-processing unit 120 may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the terms. In other words, the data pre-processing unit 120 may perform a pre-processing process of removing duplicate data and null values (data not including terms) of the determined column data.
  • null value corresponding to a blank is also data in which term clustering is unnecessary, so the data preprocessing unit 120 Null values can be removed.
  • the user can substitute a null value (data that does not include a term) with another term as needed.
  • the user input receiving unit 160 may receive term clustering information from a user terminal.
  • the term clustering information may include alternative terms for replacing data that does not contain terms with other terms.
  • the data preprocessing unit 120 may input an alternative term into data that does not include the alternative term provided from the user input receiving unit 160.
  • FIG. 3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application
  • FIG. 4 exemplarily shows the results of ranking in the reverse order of the frequency of morphemes in the term clustering device according to an embodiment of the present application
  • 5 is a diagram illustrating a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.
  • the recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data.
  • the recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data. Separation of morphemes may be to separate sentences or text into the smallest units and automatically determine the part of speech of the morpheme. In addition, the morpheme separation may be to separate the morpheme into the smallest unit having meaning using the original term. Morphological analysis is the most basic technique for natural language processing.
  • the recommended term determining unit 130 may divide the “colon polyp removal” into “colon”, “polyp”, and “removal”.
  • the recommended term determining unit 130 may divide the'unspecified pneumonia' into'detailed','unknown', and'pneumonia'.
  • the recommended term determining unit 130 may perform morpheme separation based on a specific part of speech.
  • the separated morphemes may be used to prioritize recommended terms using weights.
  • the recommended term determining unit 130 may rank the frequencies of the separated morphemes (Rank). Through this, the user can check the most frequently used morpheme in the column (data) selected by the user. For example, the recommended term determining unit 130 may arrange the results of ranking by sorting the frequency (Rank) of the separated morphemes as shown in FIG. 4.
  • the recommended term determination unit 130 may calculate the recommended score of the original term using the separated morpheme.
  • the recommendation term determining unit 130 may calculate a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the morpheme.
  • the value obtained by quantifying the extraction frequency of the separated morphemes may mean a value of Rank shown in FIG. 4.
  • the weight may be a value obtained by proportioning the proportion of morphemes in the original term.
  • the recommended term determining unit 130 ranks the reverse order of the frequencies so that the higher the frequency, the higher the ranking (1st rank), and ratios the weight of morphemes in all terms to the weight. Can be used.
  • the recommended term determining unit 130 may quantify the ranking of the separated morphemes in the reverse order, so that the higher the frequency, the higher the morpheme score may be to determine the score of the morpheme.
  • the weight of the morpheme may be calculated by taking the specific gravity of the length of each morpheme in the term containing the morpheme.
  • the recommended term determining unit 130 may calculate the recommended score of the term by multiplying the weight and the morpheme score and adding it from the entire term.
  • the recommended score is a result obtained by dividing the frequency of extraction of the separated morphemes and the length of each of the plurality of morphemes classified in the first original term by the total length of the first original term based on the numerical value of the extraction frequency of the separated morphemes. It may be calculated using the sum of.
  • the length of the first morpheme is divided by the total length of the first morpheme, and as a result, the length of the second morpheme is the first original term
  • the recommended score of the term can be calculated using the sum of the result of dividing by the total length of and the result of dividing the length of the third morpheme by the total length of the first original term.
  • the recommendation score may be expressed as [Equation 1].
  • n is the number of morphemes separated from the term
  • rank is the order of frequency (reverse order).
  • the numbers ranking the frequencies of'detail','unknown', and'pneumonia' are 402, 399, and 330, respectively, and the total length of'detailed pneumonia' is spaced apart. Including 7 characters and separated morphemes are 2 characters each. The weight is weighted by dividing the length of the morpheme letter by 2 and the total length of 7. As a result, the recommended score for'Detailed Pneumonia' is 323 points, which is a number that is multiplied by 2/7 to 402, 399, and 330, which are the rankings of frequencies of'Detailed','Unknown', and'Pneumonia'.
  • the recommended term determining unit 130 calculates the frequency in consideration of all parts of speech during morpheme analysis, but is not limited thereto.
  • the recommended term determining unit 130 may perform a morpheme analysis by determining the part of speech based on the part of speech determination information included in the term clustering performance information provided through the user input receiving unit 160.
  • the recommended term determining unit 130 may perform morpheme separation based on pre-processed data based on part-of-speech determination information included in term clustering performance information when morpheme separation.
  • FIG. 6 is a diagram schematically illustrating a ranking of recommended terms according to calculation of a recommended score of terms in a term clustering apparatus according to an embodiment of the present application.
  • the recommendation term determining unit 130 may prioritize and sort terms having a high recommendation score of the calculated term by applying the calculation method of the recommendation score described above.
  • the recommended term determining unit 130 may select the recommended term by calculating the recommended score using the frequency and weight of the separated morphemes. In addition, the recommended term determining unit 130 may determine the priority of the recommended term using the weighted and ranked frequency. Also, the recommended term determining unit 130 may determine the priority of the recommended term by using weights using the morphemes ranked and the lengths of the morphemes.
  • FIG. 7 is a view illustratively illustrating phonology of recommended terms in the term clustering apparatus according to an embodiment of the present application.
  • the data cluster 140 may separate the phonology of the original term and perform similarity calculation between each original term separated by phonology.
  • the data cluster 140 may be divided into phonology according to the Hangul alphabet as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm.
  • the data cluster 140 may perform similarity calculation between each original term using the Fuzzy Data Matching algorithm, but is not limited thereto.
  • Fuzzy Data Matching Algorithm is an algorithm that performs matching between data using the calculated result based on the edit distance (Levenshtein Distance).
  • the data cluster 140 may divide terms consisting of syllables into supersonic, neutral, and final phonology according to the Hangul alphabet. This is because the Fuzzy Data Matching algorithm, which is an artificial intelligence-based algorithm, uses a method of calculating similarity based on the shape of each word. Unlike English, which is the basic language of this algorithm, in the case of Korean, Korean alphabet corresponding to the English alphabet is combined to produce letters, so the data cluster 140 solves the syllables of Hangeul like alphabets to make Korean alphabets. Similarity can be calculated by separating.
  • the data cluster 140 may calculate similarity by separating it into' ⁇ ⁇ ' in the case of'detailed unknown pneumonia'. However, in the case of English, this process can be omitted.
  • the data clustering unit 140 may calculate and group the similarity between unstructured data selected by the user.
  • the data clustering unit 140 may calculate the similarity between terms separated by phonology, and when a certain similarity value is exceeded, the data clustering unit 140 may cluster the recommended terminology according to the priority of the recommended terms.
  • the data clustering unit 140 may cluster original terms whose similarity calculation values are greater than or equal to a preset threshold.
  • the threshold may be modified and changed according to the user's convenience. In other words, the threshold value may be changed based on the threshold correction information included in the term clustering performance information received by the user input receiving unit 160.
  • the data cluster 140 may perform similarity calculation of each of a plurality of terms. For example, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the second original term (backbone sprain). Also, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the third original term (acute hepatitis). The data cluster 140 may perform similarity calculation for each of the first original term and the nth original term included in the column. The similarity calculation performed by the data cluster 140 may then be used to determine the recommended term of the recommended term determining unit 130.
  • the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the second original term (backbone sprain). Also, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the third original term (acute hepatitis). The data cluster 140 may perform similarity calculation for each of the first original term and the nth original term included in the column. The similarity calculation performed by the data
  • the data clustering unit 140 may separate and cluster the recommended term priority data determined by the recommended term determining unit 130 in phonological units.
  • the clustering can cluster the original term when a certain degree of similarity is exceeded based on the similarity based on the edit distance.
  • FIG. 8 is a view showing results of the term clustering in the term clustering device according to an embodiment of the present application by way of example.
  • the recommended term determining unit 130 may determine the recommended term among the plurality of original terms based on the recommendation score and the clustering result.
  • the recommended term determination unit 130 includes the first clustering result 11 including the first original term (1), the second original term (2), the third original term (3), the fourth original term (4), etc. Based on, it is possible to determine a recommended term among a plurality of original terms.
  • the recommended term determining unit 130 includes the first original term (1), the second original term (2), the third original term (3), the fourth original term (4) included in the first clustering result (11),
  • the recommended term may be determined based on the recommendation scores of the fifth original term (5), the sixth original term (6), and the seventh original term (7).
  • the recommended term determining unit 130 may select the original term having the highest recommended score among the original terms included in the first clustering result 11 as the recommended term.
  • the first clustering result 11 may be a result of clustering the original terms above a predetermined threshold in the data clustering unit 140.
  • the recommended score of the first original term is 377
  • the recommended score of the second original term is 323
  • the recommended score of the third original term is 323
  • the recommended score of the fourth original term may be 310.
  • the recommended term determining unit 130 may determine the first original term (detailed pneumonia) having the highest recommended score from the first original term to the fourth original term as the recommended term.
  • the similarity value may be a similarity value between the first original term (unspecified pneumonia) and the second original term (unspecified pneumonia) determined as a recommended term.
  • the recommended term determining unit 130 selects the original term having the highest recommended score among the plurality of original terms included in the first clustering result 11 as the recommended term, and the similarity is similarity between the recommended term and the original term. It can be a value.
  • the terms (data) clustered based on the recommended term'backbone sprain' are'neck back sprain, sprain back sprain,' neck sprain, back sprain, and neck sprain
  • by using special symbols, spaces, investigations, and conjunctions included in the terms it can be seen that terms expressed in different forms are clustered into the recommended term'neck lumbar sprain'.
  • the data result unit 150 may provide a cluster result of the data cluster 140.
  • Cluster results may include recommended terms, original terms, and similarity values.
  • the original term may be a term included in the column.
  • the original term may be data of an initial value included in the database 110.
  • the disease name may correspond to the original term.
  • the recommended term may be a term determined among a plurality of original terms based on a recommendation score and a clustering result.
  • the similarity may be a similarity value between the recommended term and the original term.
  • the similarity may be a similarity value between each original term calculated by separating into phonons in the data cluster 140.
  • the user may provide the data result unit 150 in the form of a percentage between the original term clustered according to the recommended term and the recommended term and the original term.
  • the similarity value between the recommended term'detailed unknown pneumonia' and the first original term'detailed unknown pneumonia' in the first clustering result 11 may be 100.
  • the similarity value between the recommended term'detailed pneumonia' in the first clustering result 11 and the second source term'detailed pneumonia' may be 94.
  • the similarity value between the recommended term'detailed pneumonia' of the first clustering result 11 and the third source term'detailed pneumonia' may be 100. If the existing clustering method is applied, the recommended terms'detailed pneumonia' and the third source term'detailed pneumonia' are the same terms including spaces, but there may be a problem that clustering does not occur. By separating into the phonology of the data cluster 140 and performing similarity calculation, it is possible to solve the problem of not clustering due to spacing.
  • the data result unit 150 may check, store, and modify the results of the data cluster 140.
  • the clustering result that can be confirmed in the data result unit 150 is a result of clustering data included in a column item of the selected data set.
  • the clustered result shows the original data contained in the column items of the data set, and the similarity between the recommended term and the recommended term and the original data. The user can check the clustering result in the data result section and correct the recommended term.
  • the data result unit 150 may modify the recommended term based on the data term clustering result.
  • the data result unit 150 may modify the recommended term by requesting to modify the recommended term provided from the user terminal.
  • the recommended term is a term that the user wants to standardize for convenience or the user can modify it.
  • the data result unit 150 may store a term (data) clustering result.
  • the term (data) clustering can be stored in a form desired by the user, such as the database 110, for targeted data.
  • the user instead of revising the recommended term instead of the original term, the user can reconfirm the cluster result by storing in a new column (column containing the recommended term).
  • the user input receiving unit 160 may receive term clustering performance information from a user terminal (not shown).
  • the term clustering performance information may include user input information for determining a first data set to perform term clustering among data sets included in the database.
  • the term clustering performance information may include originals included in preprocessed data. When separating the morphemes of the term, it may include user input information for determining part of speech.
  • the term clustering performance information may include user input information for setting a threshold for distinguishing similarity between each original term separated by phonology.
  • the user input receiving unit 160 may provide a term clustering menu to a user terminal (not shown).
  • a user terminal downloads and installs an application program provided by the term clustering device 100, and a term clustering menu may be provided through the installed application.
  • the user input receiving unit 160 may transmit and receive data, contents, and various communication signals to and from a user terminal (not shown) through a network, and include all types of servers, terminals, or devices having functions of data storage and processing. .
  • a user terminal is a device interworking with the user input receiving unit 160 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and a PCS (Personal Communication System). ), Global System for Mobile Communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000 , W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet), and all kinds of wireless communication devices and desktop computers, fixed terminals such as smart TVs.
  • GSM Global System for Mobile Communication
  • PDC Personal Digital Cellular
  • PHS Personal Handyphone System
  • PDA Personal Digital Assistant
  • IMT International Mobile Telecommunication
  • CDMA Code Division Multiple Access
  • W-CDMA Wide-Code Division Multiple Access
  • Wibro Wireless Broadband Internet
  • Examples of networks for sharing information between the user input receiver 160 and a user terminal include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, and World Interoperability for Microwave Access (WIMAX).
  • 3GPP 3rd Generation Partnership Project
  • LTE Long Term Evolution
  • 5G 5G
  • WWX World Interoperability for Microwave Access
  • ) Network Wired and Wireless Internet, Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth Network, Wifi Network, NFC( Near Field Communication (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, and the like may be included, but are not limited thereto.
  • 3GPP 3rd Generation Partnership Project
  • LTE Long Term Evolution
  • 5G Fifth Generation Partnership Project
  • WWX World Interoperability for Microwave Access
  • LAN Local Area Network
  • LAN Wireless Local Area Network
  • FIG. 9 is an operation flowchart of a method for clustering terms of unstructured text data for big data analysis according to an embodiment of the present application.
  • the term clustering method of unstructured text data for analyzing big data shown in FIG. 9 may be performed by the term clustering device 10 described above. Therefore, even if omitted, the description of the term clustering device 10 may be equally applied to the description of the term clustering method of unstructured text data for big data analysis.
  • step S901 the term clustering apparatus 10 may select data from a data set included in the database and perform preprocessing.
  • the term clustering device 10 may separate the morphemes of the original terms included in the pre-processed data, and calculate a recommendation score of the original terms.
  • the term clustering apparatus 10 may separate the phonology of the original term, perform similarity calculation between each original term separated by phonology, and cluster the original term in which the similarity calculation value is greater than or equal to a preset threshold. .
  • the term clustering apparatus 10 may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.
  • steps S901 to S904 may be further divided into additional steps or combined into fewer steps, depending on the implementation herein. Also, some steps may be omitted if necessary, and the order between the steps may be changed.
  • the term clustering method of unstructured text data for big data analysis may be implemented in the form of program instructions that can be performed through various computer means and may be recorded in a computer readable medium.
  • the computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination.
  • the program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks.
  • -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like.
  • program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.
  • the hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.
  • the method for clustering terms of unstructured text data for big data analysis described above may also be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Automation & Control Theory (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a device for term clustering of unstructured text data for big data analysis, and the device for term clustering of unstructured text data for big data analysis may comprise: a database including a data set; a data preprocessing unit for selecting and preprocessing data from a data set included in the database; a recommended term determination unit for separating morphemes of original terms included in preprocessed data, and calculating recommended scores of the original terms; and a data clustering unit for separating phonemes of original terms, calculating the similarity between the respective original terms separated into phonemes, and clustering an original term having a similarity calculation value equal to or greater than a preconfigured threshold value, wherein the recommended term determination unit determines a recommended term among the plurality of original terms on the basis of the recommended scores and a result of the clustering.

Description

빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치 및 방법Apparatus and method for clustering terminology of unstructured text data for big data analysis
본원은 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for clustering terms of unstructured text data for big data analysis.
빅데이터 분석을 위해 전체 소요 노력의 70% ~80%를 데이터 전처리에 사용하고 있다. 빅데이터 분석은 폭발적으로 증가하고 있으나 빅데이터 분석 기술 발전만큼 데이터 전처리에 관한 기술의 발전 속도는 느리며 이에 따라 자동화된 데이터 전처리 기술 개발의 필요성이 대두되고 있다.For big data analysis, 70% to 80% of the total effort is used for data preprocessing. Big data analysis is explosively increasing, but as the development of big data analysis technology, the development rate of data preprocessing is slow, and accordingly, the need to develop automated data preprocessing technology is emerging.
공공정보 개방 환경과 맞물려 비정형 데이터 분석에 대한 소요가 늘어나, 비정형 데이터 전처리에 대한 중요도가 강조되고 있음에도 불구하고, 전처리의 대부분을 수작업으로 할애하고 있다.Despite the increasing demand for unstructured data analysis in connection with the open environment of public information, the importance of preprocessing unstructured data is being emphasized, but most of the preprocessing is done manually.
텍스트 데이터에서 데이터 간 유사도를 계산하기 위해서 가장 많이 사용되는 알고리즘은 Fuzzy Matching 알고리즘이다. 이 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간의 유사도를 계산해 주는 알고리즘이다.Fuzzy Matching algorithm is the most used algorithm to calculate the similarity between data in text data. This algorithm is an algorithm that calculates the similarity between data using the result value calculated based on the edit distance (Levenshtein Distance).
Fuzzy Matching 알고리즘은 단순히 두 데이터 상호 간의 유사도만을 계산해주는 알고리즘이다. 이 알고리즘을 응용하여 데이터 내에서 일정한 유사도를 가진 데이터들을 군집화한다. 또한, Fuzzy Matching 알고리즘은 영문을 기반으로 개발되어 있으므로 국문에 적용하였을 경우 음운이 아닌 음절을 바탕으로 유사도를 계산하는 문제점을 가지고 있다. Fuzzy Matching algorithm is an algorithm that simply calculates the similarity between two data. By applying this algorithm, data having a certain similarity in the data are clustered. In addition, since the Fuzzy Matching algorithm is developed based on English, when applied to Korean, it has a problem of calculating the similarity based on syllables, not phonology.
또한, 형태소 분석은 자연어처리에서 가장 핵심적인 기술로, 말 또는 문장을 가장 작은 말의 단위인 형태소로 분리해주며, 분리된 형태소의 품사를 판단한다. 데이터 셋 내에서 형태소 분석을 통한 빈도수 확인을 통해, 해당 데이터 셋 내의 핵심 형태소를 확인할 수 있다.In addition, morpheme analysis is the most important technique in natural language processing, separating words or sentences into morphemes, which are the smallest units of words, and determining the part-of-speech speech. By checking the frequency through morpheme analysis in the data set, the key morphemes in the data set can be identified.
본원의 배경이 되는 기술은 한국공개특허공보 제10-2016-0075974호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Publication No. 10-2016-0075974.
본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 빅데이터 비정형 텍스트 전처리의 어려움을 극복하고자, 데이터 내의 유사한 용어들을 군집화시켜, 빅데이터 분석을 용이하게 할 수 있는 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application is to solve the above-described problems of the prior art, in order to overcome the difficulties of preprocessing unstructured text in big data, unstructured text data for big data analysis that can facilitate big data analysis by clustering similar terms in the data It is intended to provide a term clustering device and method.
본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 데이터 셋 내 용어를 군집화 시 대표 용어를 추천함으로써, 사용자가 수작업으로 수행해야 하는 비정형 데이터 전처리 과정의 시간을 감소시켜 줄 수 있는 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application is intended to solve the problems of the prior art described above, and recommends representative terms when clustering terms in a data set, thereby analyzing big data that can reduce the time of the unstructured data preprocessing process that the user must manually perform. It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.
본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 형태소 분석을 응용하여 사용자에게 대표 용어를 자동으로 추천하는 방법을 제공함으로써, 사용자가 대표 단어를 선정하는 시간을 감소시켜 줄 수 있는 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application is to solve the above-described problems of the prior art, and by providing a method of automatically recommending a representative term to a user by applying a morpheme analysis, big data analysis that can reduce the time for the user to select a representative word It is an object of the present invention to provide an apparatus and method for clustering terms of unstructured text data.
본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 유사도 계산 과정에서 국문 데이터의 편집 거리를 음절 단위로 계산하는 문제점을 해결하기 위해서 각 음절을 음운으로 분리하여 계산하는 방법을 사용하여 군집화함으로써, 사용자의 오타 등과 같은 휴먼 에러를 교정해 주어 비정형 데이터 표준화에 도움을 줄 수 있는 비정형 텍스트 데이터의 용어 군집화 장치 및 방법을 제공하려는 것을 목적으로 한다.In order to solve the problem of calculating the edit distance of Korean data in syllable units in the process of calculating similarity, the present application is to solve the problems of the prior art described above, and clustering using a method of calculating each syllable by separating it into phonons, An object of the present invention is to provide an apparatus and method for clustering terminology of unstructured text data that can help to standardize unstructured data by correcting human errors such as typos.
다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.
상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치는, 데이터 셋을 포함하는 데이터 베이스, 상기 데이터 베이스에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행하는 데이터 전처리부, 전처리된 데이터에 포함된 원본 용어의 형태소를 분리하고, 원본 용어의 추천 점수를 계산하는 추천 용어 결정부 및 원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행하고, 상기 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화하는 데이터 군집부를 포함하되, 상기 추천 용어 결정부는, 상기 추천 점수 및 상기 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정할 수 있다. As a technical means for achieving the above technical problem, a term clustering apparatus of unstructured text data for big data analysis according to an embodiment of the present application includes: a database including a data set, a data set included in the database Data pre-processing unit for selecting data and performing pre-processing, separating the morphemes of the original terms included in the pre-processed data, separating the phonetic terms of the original terms, and separating the phonetic terms of the original terms Comprising a similarity operation between each of the original terms, and the similarity calculation value includes a data clustering unit for clustering the original terms above a preset threshold, wherein the recommendation term determining unit is based on the recommendation score and the clustering result. You can decide the recommended term among the original terms.
본원의 일 실시예에 따른 상기 전처리부는, 상기 데이터 베이스에 포함된 데이터 셋 중 용어 군집화를 수행할 제1데이터 셋을 결정하고, 상기 제1데이터 셋의 복수의 칼럼 항목 중 용어 군집화를 수행할 제1칼럼을 선택하여 선택된 칼럼 항목의 데이터 전처리를 수행할 수 있다.The pre-processing unit according to an embodiment of the present application determines a first data set to perform term clustering among data sets included in the database, and performs term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column.
본원의 일 실시예에 따른 상기 전처리부는, 선택된 상기 칼럼에 포함된 중복 용어 및 용어를 포함하지 않는 데이터를 제거하는 전처리를 수행할 수 있다. The pre-processing unit according to an embodiment of the present application may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the term.
본원의 일 실시예에 따른 상기 추천 용어 결정부는, 분리된 형태소의 추출빈도를 수치화한 값 및 분리된 형태소를 기반으로 추출된 가중치를 이용하여 추천 점수를 계산하고, 상기 가중치는 원본 용어 내 형태소의 비중을 비율화한 값일 수 있다. The recommendation term determining unit according to an embodiment of the present application calculates a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the separated morphemes, and the weights of the morphemes in the original term It may be a ratio of specific gravity.
본원의 일 실시예에 따른 추천 점수는, 상기 분리된 형태소의 추출빈도를 수치화한 값을 기반으로 분리된 형태소의 추출 빈도수 및 제1 원본 용어에서 분류된 복수의 형태소 각각의 길이를 상기 제1원본 용어의 전체길이로 나누어 연산한 결과의 합을 이용하여 계산되는 것일 수 있다. The recommendation score according to an embodiment of the present application is based on a value obtained by quantifying the extraction frequency of the separated morphemes, and the frequency of extraction of the separated morphemes and the length of each of a plurality of morphemes classified in the first original term. It may be calculated using the sum of the results obtained by dividing by the total length of the term.
본원의 일 실시예에 따른 상기 데이터 군집부는, 상기 데이터가 한글일 경우, 초성, 중성, 종성으로 한글 자모에 따른 음운으로 분리하고, 인공지능 기반의 알고리즘을 이용하여 유사도를 연산할 수 있다. When the data is Korean, the data clustering unit according to an embodiment of the present disclosure may be divided into phonons according to Hangeul alphabets as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm.
본원의 일 실시예에 따른 용어 군집화 장치는, 상기 데이터 군집부의 군집 결과를 제공하는 데이터 결과부를 더 포함하되, 상기 군집 결과는 추천 용어, 원본 용어, 유사도 값을 포함할 수 있다. The term clustering apparatus according to an embodiment of the present application further includes a data result unit providing a cluster result of the data cluster, but the cluster result may include a recommended term, an original term, and a similarity value.
본원의 일 실시예에 따른 용어 군집화 장치는, 사용자 단말로부터 용어 군집화 수행 정보를 수신하는 사용자 입력 수신부를 더 포함할 수 있다. The term clustering apparatus according to an embodiment of the present application may further include a user input receiver configured to receive term clustering performance information from a user terminal.
본원의 일 실시예에 따른 상기 추천 용어 결정부는, 형태소 분리 시 상기 용어 군집화 수행 정보에 포함된 품사 결정 정보에 기반하여 상기 전처리된 데이터를 기반으로 형태소 분리를 수행할 수 있다. The recommended term determining unit according to an embodiment of the present application may perform morpheme separation based on the pre-processed data based on part-of-speech determination information included in the term clustering performance information when morpheme separation.
본원의 일 실시예에 따른 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법은, 데이터 베이스에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행하는 단계, 전처리된 데이터에 포함된 원본 용어의 형태소를 분리하고, 원본 용어의 추천 점수를 계산하는 단계 및 원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행하고, 상기 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화하는 단계, 상기 추천 점수 및 상기 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정하는 단계를 포함할 수 있다. The term clustering method of unstructured text data for analyzing big data according to an embodiment of the present application includes: selecting data from a data set included in a database and performing pre-processing, morphemes of original terms included in pre-processed data Separating, calculating a recommendation score of the original term, separating the phonology of the original term, performing similarity calculation between each original term separated by phonology, and clustering the original term whose value of the similarity calculation is greater than or equal to a preset threshold. The method may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.
상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.
전술한 본원의 과제 해결 수단에 의하면, 데이터 베이스의 데이터 셋에서 항목을 선택하여, 형태소 분석을 통한 추천용어를 선정하여 음절을 음운 단위로 변환하고, 용어 군집화를 수행하여, 데이터 셋 내의 유사한 용어들을 군집화할 수 있다.According to the above-described problem solving means of the present application, by selecting an item from a database data set, selecting a recommended term through morphological analysis, converting syllables into phonological units, performing term clustering, and similar terms in the data set. Can be clustered.
전술한 본원의 과제 해결 수단에 의하면, 형태소 분석을 수행한 후 용어 군집화를 수행하여 우선순위를 설정한 추천 용어를 사용자에게 제공함으로써 정밀도가 높은 용어 추천하여 군집화 할 수 있다.According to the above-mentioned problem solving means of the present application, after performing morphological analysis, term clustering may be performed to provide a recommendation term for which a priority has been set to a user, thereby recommending and clustering terms with high precision.
전술한 본원의 과제 해결 수단에 의하면, 용어를 음운 단위로 유사도를 계산하여 군집화하기 때문에, 오타 등과 같은 표기 오류도 데이터 셋 내의 추천용어로 치환하여 사용할 수 있다.According to the above-described problem solving means of the present application, since terms are grouped by calculating similarity in phonological units, notation errors such as typos and the like can also be substituted for recommended terms in the data set.
전술한 본원의 과제 해결 수단에 의하면, 데이터의 표기 오류나 다르게 표현된 용어들을 추천용어로 군집화함으로써 비정형 빅데이터 분류를 보다 정밀하게 수행할 수 있다.According to the above-mentioned problem solving means of the present application, it is possible to more accurately perform unstructured big data classification by clustering errors in data or terms expressed in different terms as recommended terms.
다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the above-described effects, and other effects may exist.
도 1은 본원의 일 실시예에 따른 용어 군집화 장치의 개략적인 구성도이다.1 is a schematic configuration diagram of a term clustering device according to an embodiment of the present application.
도 2는 본원의 일 실시예에 따른 용어 군집화 장치의 군집화를 수행할 데이터 항목의 일부를 개략적으로 나타낸 도면이다.2 is a diagram schematically showing a part of a data item to perform clustering of a term clustering apparatus according to an embodiment of the present application.
도 3은 본원의 일 실시예에 따른 용어 군집화 장치의 형태소 분리를 설명하기 위한 도면이다.3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application.
도 4는 본원의 일 실시예에 따른 용어 군집화 장치에서 형태소의 빈도를 역순으로 순위화 결과를 예시적으로 나타낸 도면이다.FIG. 4 is a diagram exemplarily showing results of ranking in the reverse order of the frequency of morphemes in the term clustering apparatus according to an embodiment of the present application.
도 5는 본원의 일 실시예에 따른 용어 군집화 장치에서 용어의 추천 점수 계산한 결과를 예시적으로 나타낸 도면이다.FIG. 5 is a diagram exemplarily showing a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.
도 6은 본원의 일 실시예에 따른 용어 군집화 장치에서 용어의 추천 점수 계산에 따른 추천 용어의 순위를 설명하기 위하여 개략적으로 나타낸 도면이다.FIG. 6 is a diagram schematically illustrating a ranking of recommended terms according to calculation of a recommended score of terms in a term clustering apparatus according to an embodiment of the present application.
도 7은 본원의 일 실시예에 따른 용어 군집화 장치에서 추천 용어의 음운화를 설명하기 위하여 예시적으로 나타낸 도면이다.FIG. 7 is a view illustratively illustrating phonology of recommended terms in the term clustering apparatus according to an embodiment of the present application.
도 8은 본원의 일 실시예에 따른 용어 군집화 장치에서 용어 군집화 결과를 예시적으로 나타낸 결과이다.8 is a view showing results of the term clustering in the term clustering device according to an embodiment of the present application by way of example.
도 9는 본원의 일 실시예에 따른 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법에 대한 동작 흐름도이다.9 is an operation flowchart of a method for clustering terms of unstructured text data for big data analysis according to an embodiment of the present application.
아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present application pertains may easily practice. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.
본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" to another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element in between. "It also includes the case where it is.
본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is positioned on another member “on”, “on the top”, “top”, “bottom”, “bottom”, “bottom”, this means that one member is attached to another member. This includes cases where there is another member between the two members as well as when in contact.
본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout this specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated.
도 1은 본원의 일 실시예에 따른 용어 군집화 장치의 개략적인 구성도이다.1 is a schematic configuration diagram of a term clustering device according to an embodiment of the present application.
도 1을 참조하면, 용어 군집화 장치(100)는 데이터 베이스(110), 데이터 전처리부(120), 추천 용어 결정부(130), 데이터 군집부(140), 데이터 결과부(150) 및 사용자 입력 수신부(160)를 포함할 수 있다. Referring to FIG. 1, the term clustering apparatus 100 includes a database 110, a data preprocessing unit 120, a recommended term determination unit 130, a data clustering unit 140, a data result unit 150, and user input It may include a receiving unit 160.
본원의 일 실시예에 따르면, 데이터 용어 군집화 장치(100)는 데이터 셋 중 단일 칼럼 항목을 선택하여, 선택한 칼럼 내에서 형태소 분석에 따른 가중치 연산을 통해 우선 추천 용어를 선정할 수 있다. 데이터 용어 군집화 장치(100)는 추천 용어와 유사도 연산에 기초하여 원본 용어를 군집화할 수 있다. 유사도 연산에는 음절을 음운으로 분리하는 전처리 과정이 포함될 수 있다. 또한, 추천용어가 군집화된 원본 용어들을 대표하지 않을 경우 사용자가 임의의 추천 용어를 입력하여 원본 용어들을 군집화 할 수 있다.According to an embodiment of the present application, the data term clustering apparatus 100 may select a single column item from the data set, and first select a recommended term through weight calculation according to morpheme analysis within the selected column. The data term clustering apparatus 100 may cluster the original term based on the recommended term and similarity calculation. The similarity calculation may include a pre-processing process for separating syllables into phonemes. In addition, if the recommended terms do not represent the original clustered terms, the user can cluster the original terms by entering an arbitrary recommended term.
또한, 데이터 용어 군집화 장치(100)는 선택된 데이터 셋의 칼럼에서 자동화된 용어 군집화 알고리즘을 이용하여 칼럼 내 원본 용어들을 들을 군집화 할 수 있다. 또한, 데이터 용어 군집화 장치(100)는 형태소 분석을 사용한 가중치 연산을 사용하여 추천용어를 제공함으로써, 사용자의 편의를 고려한 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법을 제공할 수 있다. In addition, the data term clustering apparatus 100 may cluster original terms in a column using an automated term clustering algorithm in a column of a selected data set. In addition, the data term clustering apparatus 100 may provide a terminology clustering method of unstructured text data for big data analysis in consideration of user convenience by providing a recommended term using weight calculation using morphological analysis.
데이터 베이스(110)는 용어 군집화에 사용되는 데이터 셋을 포함할 수 있다. 데이터 베이스(110)는 비정형 데이터를 포함할 수 있다. 비정형 데이터, 비구조화 데이터, 비구조적 데이터는 미리 정의된 데이터 모델이 없거나 미리 정의된 방식으로 정리되지 않은 정보를 의미할 수 있다. 비정형 데이터(Unstructured Data)란 일정한 규격이나 형태를 지닌 숫자데이터(Numeric data)와 달리 그림이나 영상, 문서처럼 형태와 구조가 다른 구조화 되지 않은 데이터를 의미할 수 있다.The database 110 may include a data set used for term clustering. The database 110 may include unstructured data. Unstructured data, unstructured data, and unstructured data may refer to information that does not have a predefined data model or is not organized in a predefined manner. Unstructured data may refer to unstructured data having different shapes and structures, such as pictures, images, and documents, unlike numeric data having a certain standard or shape.
도 2는 본원의 일 실시예에 따른 용어 군집화 장치의 군집화를 수행할 데이터 항목의 일부를 개략적으로 나타낸 도면이다.2 is a diagram schematically showing a part of a data item to perform clustering of a term clustering apparatus according to an embodiment of the present application.
도 2를 참조하면, 데이터 베이스(110)에 포함된 데이터 셋은 2개의 칼럼 항목 또는 그 이상의 칼럼항목을 포함할 수 있다. 데이터 셋에 포함된 칼럼 항목은 대표키와 일반 칼럼으로 구분될 수 있다. 예를 들어, 도 2의 데이터 셋의 칼럼 항목의 대표키는 '환자 ID'일 수 있고, 일반 칼럼은 '병명'일 수 있다. 이때 일반 칼럼인 '병명'의 경우 비정형 텍스트 데이터로 이루어질 수 있다. 비정형 텍스트 데이터는 용어를 포함할 수 있다. Referring to FIG. 2, a data set included in the database 110 may include two column items or more column items. Column items included in the data set can be divided into a representative key and a general column. For example, the representative key of the column item of the data set of FIG. 2 may be “patient ID”, and the general column may be “bottle name”. In this case, in the case of the general column'Byeongmyeong', it may consist of unstructured text data. Unstructured text data may include terms.
데이터 전처리부(120)는 데이터 베이스(110)에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행할 수 있다. 데이터 전처리부(120)는 데이터 베이스(110)에 저장된 데이터 셋의 복수의 칼럼 중 데이터 군집화를 수행할 칼럼을 선택하여 결정할 수 있다. The data pre-processing unit 120 may select data from a data set included in the database 110 and perform pre-processing. The data preprocessing unit 120 may select and select a column to perform data clustering among a plurality of columns of the data set stored in the database 110.
달리 말해, 데이터 전처리부(120)는 데이터 베이스(110)에 포함된 데이터 셋 중 용어 군집화를 수행할 제1데이터 셋을 결정하고, 제1데이터 셋의 복수의 칼럼 항목 중 용어 군집화를 수행할 제1칼럼을 선택하여 선택된 칼럼 항목의 데이터 전처리를 수행할 수 있다. 예시적으로 도 2를 참조하면, 데이터 전처리부(120)는 데이터 베이스(110)에 포함된 복수의 데이터 셋 중 용어 군집화를 수행할 제1데이터 셋을 결정하고, 제1데이터 셋의 용어 군집화를 수행할 '병명' 칼럼을 선택하여 선택된 '병명' 칼럼 항목의 데이터 전처리를 수행할 수 있다. In other words, the data preprocessing unit 120 determines a first data set to perform term clustering among data sets included in the database 110, and performs a term clustering among a plurality of column items of the first data set. Data can be preprocessed for the selected column item by selecting 1 column. For example, referring to FIG. 2, the data preprocessing unit 120 determines a first data set to perform term clustering among a plurality of data sets included in the database 110 and determines term clustering of the first data set. By selecting the'Bill Name' column to be performed, data pre-processing of the selected'Bill Name' column item can be performed.
데이터 전처리부(120)는 선택된 칼럼에 포함된 중복 용어 및 용어를 포함하지 않는 데이터를 제거하는 전처리를 수행할 수 있다. 달리 말해, 데이터 전처리부(120)는 결정된 칼럼의 데이터의 중복처리 및 Null 값(용어를 포함하지 않는 데이터)을 제거하는 전처리 과정을 수행할 수 있다. The data pre-processing unit 120 may perform pre-processing to remove duplicate terms included in the selected column and data that does not contain the terms. In other words, the data pre-processing unit 120 may perform a pre-processing process of removing duplicate data and null values (data not including terms) of the determined column data.
본원의 일 실시예에 따르면, 형태가 완전히 일치하는 용어의 경우 군집화가 불필요하며, 공백에 해당하는 Null 값(용어를 포함하지 않는 데이터) 또한 용어 군집화가 불필요한 데이터이므로, 데이터 전처리부(120)는 Null 값을 제거할 수 있다. 또한, 사용자는 필요에 따라 Null 값(용어를 포함하지 않는 데이터)을 다른 용어로 대체할 수 있다. According to an embodiment of the present application, in the case of a term in which the form is completely matched, clustering is unnecessary, and a null value corresponding to a blank (data not including a term) is also data in which term clustering is unnecessary, so the data preprocessing unit 120 Null values can be removed. In addition, the user can substitute a null value (data that does not include a term) with another term as needed.
예시적으로, 사용자 입력 수신부(160)는 사용자 단말로부터 용어 군집화 정보를 수신할 수 있다. 용어 군집화 정보는 용어를 포함하지 않는 데이터를 다른 용어로 대체하기 위한 대체 용어를 포함할 수 있다. 달리 말해, 데이터 전처리부(120)는 사용자 입력 수신부(160)로부터 제공받은 대체 용어를 포함하지 않는 데이터에 대체 용어를 입력할 수 있다. For example, the user input receiving unit 160 may receive term clustering information from a user terminal. The term clustering information may include alternative terms for replacing data that does not contain terms with other terms. In other words, the data preprocessing unit 120 may input an alternative term into data that does not include the alternative term provided from the user input receiving unit 160.
도 3은 본원의 일 실시예에 따른 용어 군집화 장치의 형태소 분리를 설명하기 위한 도면이고, 도 4는 본원의 일 실시예에 따른 용어 군집화 장치에서 형태소의 빈도를 역순으로 순위화 결과를 예시적으로 나타낸 도면이고, 도 5는 본원의 일 실시예에 따른 용어 군집화 장치에서 용어의 추천 점수 계산한 결과를 예시적으로 나타낸 도면이다.3 is a view for explaining the morpheme separation of the term clustering device according to an embodiment of the present application, and FIG. 4 exemplarily shows the results of ranking in the reverse order of the frequency of morphemes in the term clustering device according to an embodiment of the present application 5 is a diagram illustrating a result of calculating a recommendation score of a term in the term clustering device according to an embodiment of the present application.
도 3을 참조하면, 추천 용어 결정부(130)는 전처리된 데이터에 포함된 원본 용어의 형태소를 분리할 수 있다. 추천 용어 결정부(130)는 전처리된 데이터에 포함된 원본 용어의 형태소를 분리할 수 있다. 형태소의 분리는 문장 또는 텍스트를 가장 작은 단위로 분리하고, 그 형태소의 품사를 자동으로 판별하는 것일 수 있다. 또한, 형태소 분리는 원본 용어를 이용하여 뜻을 가진 최소단위의 형태소로 분리하는 것일 수 있다. 형태소 분석은 자연어 처리의 가장 기본이 되는 기술이다. Referring to FIG. 3, the recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data. The recommended term determining unit 130 may separate the morpheme of the original term included in the pre-processed data. Separation of morphemes may be to separate sentences or text into the smallest units and automatically determine the part of speech of the morpheme. In addition, the morpheme separation may be to separate the morpheme into the smallest unit having meaning using the original term. Morphological analysis is the most basic technique for natural language processing.
예시적으로 도 3을 참조하면, 추천 용어 결정부(130)는 '결장 폴립 제거술'을, '결장', '폴립', '제거술'로 분리할 수 있다. 또한, 추천 용어 결정부(130)는 '상세불명 폐렴'을 '상세', '불명', '폐렴'으로 분리할 수 있다. 추천 용어 결정부(130)는 특정 품사를 기반으로 형태소 분리를 수행할 수 있다. For example, referring to FIG. 3, the recommended term determining unit 130 may divide the “colon polyp removal” into “colon”, “polyp”, and “removal”. In addition, the recommended term determining unit 130 may divide the'unspecified pneumonia' into'detailed','unknown', and'pneumonia'. The recommended term determining unit 130 may perform morpheme separation based on a specific part of speech.
본원의 일 실시예에 따르면, 분리된 형태소는 가중치를 이용한 추천 용어의 우선순위 선정에 사용될 수 있다. 추천 용어 결정부(130)는 분리된 형태소의 빈도(Rank)를 정렬하여 순위화할 수 있다. 이를 통해 사용자는 사용자가 선택한 칼럼(데이터)에서 가장 많이 사용한 형태소를 확인할 수 있다. 예시적으로 추천 용어 결정부(130)는 분리된 형태소의 빈도(Rank)를 정렬하여 순위화한 결과를 도 4와 같이 정리할 수 있다. According to an embodiment of the present application, the separated morphemes may be used to prioritize recommended terms using weights. The recommended term determining unit 130 may rank the frequencies of the separated morphemes (Rank). Through this, the user can check the most frequently used morpheme in the column (data) selected by the user. For example, the recommended term determining unit 130 may arrange the results of ranking by sorting the frequency (Rank) of the separated morphemes as shown in FIG. 4.
추천 용어 결정부(130)는 분리된 형태소를 이용하여 원본 용어의 추천 점수를 계산할 수 있다. 추천 용어 결정부(130)는 분리된 형태소의 추출빈도를 수치화한 값 및 형태소를 기반으로 추출된 가중치를 이용하여 추천 점수를 계산할 수 있다. 예시적으로, 분리된 형태소의 추출빈도를 수치화한 값을 도 4에 도시된 Rank(빈도)의 값을 의미하는 것일 수 있다. 이때, 가중치는 원본 용어 내 형태소의 비중을 비율화한 값일 수 있다. 일예로 도 4와 같이 추천 용어 결정부(130)는 빈도의 역순을 순위화하여 빈도수가 높을수록 높은 순위(1순위)를 가지도록 하고, 형태소가 전체 용어에서 차지고 있는 비중을 비율화 하여 가중치로 사용할 수 있다. 달리 말해, 추천 용어 결정부(130)는 분리된 형태소의 순위를 역순으로 수치화하여, 빈도가 높은 형태소일수록 높은 값을 가질 수 있도록 형태소의 점수를 결정할 수 있다. 형태소의 가중치는 형태소가 포함된 용어에서 각각의 형태소가 가지는 길이의 비중을 가지고 계산될 수 있다. 추천 용어 결정부(130)는 가중치와 형태소의 점수를 곱하고 전체 용어에서 더함으로써 용어의 추천 점수를 계산할 수 있다. The recommended term determination unit 130 may calculate the recommended score of the original term using the separated morpheme. The recommendation term determining unit 130 may calculate a recommendation score using a value obtained by quantifying the extraction frequency of the separated morphemes and a weight extracted based on the morpheme. For example, the value obtained by quantifying the extraction frequency of the separated morphemes may mean a value of Rank shown in FIG. 4. In this case, the weight may be a value obtained by proportioning the proportion of morphemes in the original term. As an example, as shown in FIG. 4, the recommended term determining unit 130 ranks the reverse order of the frequencies so that the higher the frequency, the higher the ranking (1st rank), and ratios the weight of morphemes in all terms to the weight. Can be used. In other words, the recommended term determining unit 130 may quantify the ranking of the separated morphemes in the reverse order, so that the higher the frequency, the higher the morpheme score may be to determine the score of the morpheme. The weight of the morpheme may be calculated by taking the specific gravity of the length of each morpheme in the term containing the morpheme. The recommended term determining unit 130 may calculate the recommended score of the term by multiplying the weight and the morpheme score and adding it from the entire term.
추천 점수는, 분리된 형태소의 추출빈도를 수치화한 값을 기반으로 분리된 형태소의 추출 빈도수 및 제1 원본 용어에서 분류된 복수의 형태소 각각의 길이를 제1원본 용어의 전체길이로 나누어 연산한 결과의 합을 이용하여 계산되는 것일 수 있다. 예를 들어, 제1원본 용어에 제1형태소 내지 제3형태소가 포함되는 경우, 제1형태소의 길이를 제1원본 용어의 전체길이로 나누어 연산한 결과, 제2형태소의 길이를 제1원본 용어의 전체길이로 나누어 연산한 결과 및 제3형태소의 길이를 제1원본 용어의 전체길이로 나누어 연산한 결과의 합을 이용하여 용어의 추천점수를 계산할 수 있다. The recommended score is a result obtained by dividing the frequency of extraction of the separated morphemes and the length of each of the plurality of morphemes classified in the first original term by the total length of the first original term based on the numerical value of the extraction frequency of the separated morphemes. It may be calculated using the sum of. For example, when the first morpheme includes a first morpheme or a third morpheme, the length of the first morpheme is divided by the total length of the first morpheme, and as a result, the length of the second morpheme is the first original term The recommended score of the term can be calculated using the sum of the result of dividing by the total length of and the result of dividing the length of the third morpheme by the total length of the first original term.
예시적으로 추천 점수는 [식1]과 같이 표현될 수 있다. For example, the recommendation score may be expressed as [Equation 1].
[식1][Equation 1]
Figure PCTKR2019002778-appb-I000001
Figure PCTKR2019002778-appb-I000001
여기서, n은 용어에서 분리한 형태소의 개수이고, rank는 빈도의 순위(역순)이다. Here, n is the number of morphemes separated from the term, and rank is the order of frequency (reverse order).
예시적으로 도 4 및 도 5를 참조하면, '상세', '불명', '폐렴'의 빈도를 순위화한 수치는 각각, 402, 399, 330이며 '상세불명 폐렴'의 전체길이는 띄어쓰기를 포함하여 7글자이고 분리된 형태소는 각각 2글자씩이다. 가중치는 형태소의 글자의 길이인 2에 전체길이인 7을 나눠 가중치화 한다. 그 결과 '상세불명폐렴'의 추천 점수는 '상세', '불명', '폐렴'의 빈도를 순위화한 수치인 402, 399, 330에 각각 2/7을 곱하여 모두 더한 숫자인 323점이된다. For example, referring to FIGS. 4 and 5, the numbers ranking the frequencies of'detail','unknown', and'pneumonia' are 402, 399, and 330, respectively, and the total length of'detailed pneumonia' is spaced apart. Including 7 characters and separated morphemes are 2 characters each. The weight is weighted by dividing the length of the morpheme letter by 2 and the total length of 7. As a result, the recommended score for'Detailed Pneumonia' is 323 points, which is a number that is multiplied by 2/7 to 402, 399, and 330, which are the rankings of frequencies of'Detailed','Unknown', and'Pneumonia'.
본원의 일 실시예에 따르면, 추천 용어 결정부(130)는 형태소 분석 시에 모든 품사를 고려하여 빈도수를 계산하였으나, 이에 한정되는 것은 아니다. 예를 들어, 추천 용어 결정부(130)는 사용자 입력 수신부(160)를 통해 제공받은 용어 군집화 수행 정보에 포함된 품사 결정 정보에 기반하여 품사를 결정하여 형태소 분석을 수행할 수 있다. 달리 말해, 추천 용어 결정부(130)는 형태소 분리 시 용어 군집화 수행 정보에 포함된 품사 결정 정보에 기반하여 전처리된 데이터를 기반으로 형태소 분리를 수행할 수 있다. According to an embodiment of the present application, the recommended term determining unit 130 calculates the frequency in consideration of all parts of speech during morpheme analysis, but is not limited thereto. For example, the recommended term determining unit 130 may perform a morpheme analysis by determining the part of speech based on the part of speech determination information included in the term clustering performance information provided through the user input receiving unit 160. In other words, the recommended term determining unit 130 may perform morpheme separation based on pre-processed data based on part-of-speech determination information included in term clustering performance information when morpheme separation.
도 6은 본원의 일 실시예에 따른 용어 군집화 장치에서 용어의 추천 점수 계산에 따른 추천 용어의 순위를 설명하기 위하여 개략적으로 나타낸 도면이다.FIG. 6 is a diagram schematically illustrating a ranking of recommended terms according to calculation of a recommended score of terms in a term clustering apparatus according to an embodiment of the present application.
도 6을 참조하면, 추천 용어 결정부(130)는 앞서 설명된 추천 점수의 계산법을 적용하여 계산된 용어의 추천 점수가 높은 용어를 우선하여 정렬할 수 있다. Referring to FIG. 6, the recommendation term determining unit 130 may prioritize and sort terms having a high recommendation score of the calculated term by applying the calculation method of the recommendation score described above.
본원의 일 실시예예 따르면, 추천 용어 결정부(130)는 분리된 형태소의 빈도와 가중치를 이용하여 추천 점수를 계산하여 추천 용어를 선정할 수 있다. 또한, 추천 용어 결정부(130)는 가중치와 순위화한 빈도수를 이용하여 추천용어의 우선순위를 결정할 수 있다. 또한, 추천 용어 결정부(130)는 순위화한 형태소와 형태소의 길이를 이용한 가중치를 이용하여 추천 용어의 우선 순위를 결정할 수 있다. According to one embodiment of the present application, the recommended term determining unit 130 may select the recommended term by calculating the recommended score using the frequency and weight of the separated morphemes. In addition, the recommended term determining unit 130 may determine the priority of the recommended term using the weighted and ranked frequency. Also, the recommended term determining unit 130 may determine the priority of the recommended term by using weights using the morphemes ranked and the lengths of the morphemes.
도 7은 본원의 일 실시예에 따른 용어 군집화 장치에서 추천 용어의 음운화를 설명하기 위하여 예시적으로 나타낸 도면이다.FIG. 7 is a view illustratively illustrating phonology of recommended terms in the term clustering apparatus according to an embodiment of the present application.
도 7을 참조하면, 데이터 군집부(140)는 원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행할 수 있다. 데이터 군집부(140)는 원본 용어가 한글일 경우, 초성, 중성, 종성으로 한글 자모에 따른 음운으로 분리하고, 인공지능 기반의 알고리즘을 이용하여 유사도를 연산할 수 있다. 데이터 군집부(140)는 Fuzzy Data Matching 알고리즘을 사용하여 각각의 원본 용어간의 유사도 연산을 수행할 수 있으나, 이에 한정되는 것은 아니다. Fuzzy Data Matching 알고리즘은 편집거리(레펜슈타인, Levenshtein Distance)를 기반으로 계산된 결과값을 사용하여 데이터 간에 매칭을 수행하는 알고리즘이다.Referring to FIG. 7, the data cluster 140 may separate the phonology of the original term and perform similarity calculation between each original term separated by phonology. When the original terminology is Hangul, the data cluster 140 may be divided into phonology according to the Hangul alphabet as Chosung, Neutral, and Jongsung, and calculate similarity using an artificial intelligence-based algorithm. The data cluster 140 may perform similarity calculation between each original term using the Fuzzy Data Matching algorithm, but is not limited thereto. Fuzzy Data Matching Algorithm is an algorithm that performs matching between data using the calculated result based on the edit distance (Levenshtein Distance).
데이터 군집부(140)는 음절로 이루어져 있는 용어들을 초성, 중성, 종성으로 한글 자모에 따른 음운으로 분리할 수 있다. 예시적으로 인공지능 기반의 알고리즘 하나인 Fuzzy Data Matching 알고리즘은 각 단어들 간의 모양만으로 유사도를 계산하는 방법을 사용하기 때문이다. 이 알고리즘의 기초가 되는 언어인 영문과 달리, 국문의 경우 영문의 알파벳에 해당하는 한글의 자모가 합쳐져 글자를 만들어 내기 때문에, 데이터 군집부(140)는 알파벳과 같이 한글의 음절을 풀어서 한글의 자모로 분리시켜 유사도를 계산할 수 있다. 예시적으로, 음운 분리 없이 유사도를 계산할 경우 '강'과 '공'은 완전히 다른 글자지만, 음운분리를 하고 난, 'ㄱㅏㅇ' 과 'ㄱㅗㅇ'은 가운데 중성만 다른 비슷한 글자이므로, 데이터 군집부(140)는 원본 용어가 한글일 경우, 한글 자모에 따른 음운으로 분리할 수 있다. The data cluster 140 may divide terms consisting of syllables into supersonic, neutral, and final phonology according to the Hangul alphabet. This is because the Fuzzy Data Matching algorithm, which is an artificial intelligence-based algorithm, uses a method of calculating similarity based on the shape of each word. Unlike English, which is the basic language of this algorithm, in the case of Korean, Korean alphabet corresponding to the English alphabet is combined to produce letters, so the data cluster 140 solves the syllables of Hangeul like alphabets to make Korean alphabets. Similarity can be calculated by separating. For example, when calculating similarity without phonological separation,'strong' and'ball' are completely different letters, but after phonological separation,'ㄱㅏㅇ' and'ㄱㅗㅇ' are similar letters that differ only in the middle, so the data cluster 140, if the original term is Hangul, it can be separated into phonology according to the Hangul alphabet.
일예로 도 7을 참조하면, 데이터 군집부(140)는 '상세불명폐렴'의 경우, 'ㅅㅏㅇㅅㅔㅂㅜㄹㅁㅕㅇ ㅍㅖㄹㅕㅁ'으로 분리하여 유사도를 계산할 수 있다. 다만, 영문의 경우 이 과정을 생략할 수 있다.Referring to FIG. 7 as an example, the data cluster 140 may calculate similarity by separating it into'ㅅㅏㅇㅅㅔㅂㅜㄹㅁㅕㅇ ㅍㅖㄹㅕㅁ' in the case of'detailed unknown pneumonia'. However, in the case of English, this process can be omitted.
본원의 일 실시예에 따르면, 데이터 군집부(140)는 사용자가 선택한 비정형 데이터 간의 유사도를 계산하여 군집화할 수 있다. 또한, 데이터 군집부(140)는 음운으로 분리된 용어들 간의 유사도를 계산하여 일정 유사도 값을 넘게 되면 추천 용어들의 우선순위에 따라 추천용어 집단에 군집화할 수 있다. 달리 말해, 데이터 군집부(140)는 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화할 수 있다. 임계치는 사용자의 편의에 따라 유사도 값의 임계치를 수정 및 변화될 수 있다. 달리 말해, 임계치는 사용자 입력 수신부(160)에서 수신한 용어 군집화 수행 정보에 포함된 임계치 수정 정보에 기반하여 변경될 수 있다. According to one embodiment of the present application, the data clustering unit 140 may calculate and group the similarity between unstructured data selected by the user. In addition, the data clustering unit 140 may calculate the similarity between terms separated by phonology, and when a certain similarity value is exceeded, the data clustering unit 140 may cluster the recommended terminology according to the priority of the recommended terms. In other words, the data clustering unit 140 may cluster original terms whose similarity calculation values are greater than or equal to a preset threshold. The threshold may be modified and changed according to the user's convenience. In other words, the threshold value may be changed based on the threshold correction information included in the term clustering performance information received by the user input receiving unit 160.
예시적으로 도 7을 참조하면, 데이터 군집부(140)는 복수의 용어 각각의 유사도 계산을 수행할 수 있다. 예를 들어, 데이터 군집부(140)는 제1원본 용어(상세불명폐렴)와 제2 원본 용어(목뼈허리뼈염좌) 간의 유사도 계산을 수행할 수 있다. 또한, 데이터 군집부(140)는 제1 원본 용어(상세불명폐렴)와 제3 원본 용어(급성간염) 간의 유사도 계산을 수행할 수 있다. 데이터 군집부(140)는 칼럼에 포함된 제1원본 용어와 제n원본 용어 각각에 대해 유사도 계산을 수행할 수 있다. 데이터 군집부(140)에서 수행된 유사도 계산은 이후 추천 용어 결정부(130)의 추천 용어 결정에 사용될 수 있다. For example, referring to FIG. 7, the data cluster 140 may perform similarity calculation of each of a plurality of terms. For example, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the second original term (backbone sprain). Also, the data cluster 140 may perform similarity calculations between the first original term (unspecified pneumonia) and the third original term (acute hepatitis). The data cluster 140 may perform similarity calculation for each of the first original term and the nth original term included in the column. The similarity calculation performed by the data cluster 140 may then be used to determine the recommended term of the recommended term determining unit 130.
본원의 일 실시예에 따르면, 데이터 군집부(140)는 추천 용어 결정부(130)에서 결정한 추천 용어 우선순위 데이터들을, 음운 단위로 분리를 하고 군집화할 수 있다. 군집화는 편집거리 기반의 유사도를 기준으로, 일정 유사도를 초과할 시에 원본 용어를 군집화할 수 있다. According to one embodiment of the present application, the data clustering unit 140 may separate and cluster the recommended term priority data determined by the recommended term determining unit 130 in phonological units. The clustering can cluster the original term when a certain degree of similarity is exceeded based on the similarity based on the edit distance.
도 8은 본원의 일 실시예에 따른 용어 군집화 장치에서 용어 군집화 결과를 예시적으로 나타낸 결과이다.8 is a view showing results of the term clustering in the term clustering device according to an embodiment of the present application by way of example.
도 8을 참조하면, 추천 용어 결정부(130)는, 추천 점수 및 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정할 수 있다. Referring to FIG. 8, the recommended term determining unit 130 may determine the recommended term among the plurality of original terms based on the recommendation score and the clustering result.
추천 용어 결정부(130)는 제1 원본 용어(1), 제2 원본 용어(2), 제3 원본 용어(3), 제4 원본 용어(4) 등을 포함하는 제1군집화 결과(11)를 기반으로 복수의 원본 용어 중 추천 용어를 결정할 수 있다. 추천 용어 결정부(130)는 제1 군집화 결과(11)에 포함된 제1 원본 용어(1), 제2 원본 용어(2), 제3 원본 용어(3), 제4 원본 용어(4), 제5원본 용어(5), 제6원본 용어(6), 제7원본 용어(7)의 추천 점수에 기반하여 추천 용어를 결정할 수 있다. 추천 용어 결정부(130)는 제1군집화 결과(11)에 포함된 원본 용어 중 추천 점수가 가장 높은 원본 용어를 추천 용어로 선정할 수 있다. 제1군집화 결과(11)는 데이터 군집부(140)에서 미리 설정된 임계치 이상인 원본 용어를 군집화한 결과일 수 있다. The recommended term determination unit 130 includes the first clustering result 11 including the first original term (1), the second original term (2), the third original term (3), the fourth original term (4), etc. Based on, it is possible to determine a recommended term among a plurality of original terms. The recommended term determining unit 130 includes the first original term (1), the second original term (2), the third original term (3), the fourth original term (4) included in the first clustering result (11), The recommended term may be determined based on the recommendation scores of the fifth original term (5), the sixth original term (6), and the seventh original term (7). The recommended term determining unit 130 may select the original term having the highest recommended score among the original terms included in the first clustering result 11 as the recommended term. The first clustering result 11 may be a result of clustering the original terms above a predetermined threshold in the data clustering unit 140.
예시적으로, 제1 원본 용어(상세불명폐렴)의 추천 점수는 377이고, 제2 원본 용어(상세불명의폐렴)의 추천 점수는 323이고, 제3원본 용어(상세불명 폐렴)의 추천 점수는 323이고, 제4원본 용어(상세불명세균폐렴)의 추천 점수는 310일 수 있다. 추천 용어 결정부(130)는 제1원본 용어 내지 제4원본 용어의 추천 점수 중 가장 높은 추천 점수를 가지는 제1원본 용어(상세불명폐렴)를 추천 용어로 결정할 수 있다. 유사도 값은 추천 용어로 결정된 제1원본 용어(상세불명폐렴)와 제2 원본 용어(상세불명의폐렴)간의 유사도 값일 수 있다. 달리 말해, 추천 용어 결정부(130)는 제1군집화 결과(11)에 포함된 복수의 원본 용어 중 추천 점수가 가장 높은 원본 용어를 추천 용어로 선정하고, 상기 유사도는 추천 용어와 원본 용어 간의 유사도 값일 수 있다. For example, the recommended score of the first original term (unspecified pneumonia) is 377, the recommended score of the second original term (unspecified pneumonia) is 323, and the recommended score of the third original term (unspecified pneumonia) is 323, the recommended score of the fourth original term (unspecified bacterial pneumonia) may be 310. The recommended term determining unit 130 may determine the first original term (detailed pneumonia) having the highest recommended score from the first original term to the fourth original term as the recommended term. The similarity value may be a similarity value between the first original term (unspecified pneumonia) and the second original term (unspecified pneumonia) determined as a recommended term. In other words, the recommended term determining unit 130 selects the original term having the highest recommended score among the plurality of original terms included in the first clustering result 11 as the recommended term, and the similarity is similarity between the recommended term and the original term. It can be a value.
예시적으로 도8을 참조하면, 추천 용어 '상세불명폐렴'를 기준으로 군집화된 데이터는 '상세불명폐렴', '상세불명의폐렴, '상세불명 폐렴', '상세불명세균폐렴', '상세불명의 폐렴', '상세 불명의 폐렴', '상세불명의페렴'이다. 원본용어를 입력한 사용자에 따라 다르게 표현한 띄어쓰기, 조사 등을 제외하면 추천 용어에 군집화된 원본 용어들은 '상세불명폐렴'과 유사하게 표현된 용어임을 알 수 있다. For example, referring to FIG. 8, data clustered on the basis of the recommended terms'detailed pneumonia','detailed pneumonia','detailed pneumonia','detailed pneumonia','detailed unknown pneumonia','detailed' It is'unexpected pneumonia','unspecified pneumonia', and'unspecified pneumonia'. It can be seen that the original terms clustered in the recommended terms are terms similar to'detailed pneumonia' except for spacing and research, which are differently expressed according to the user who entered the original term.
또한, 추천 용어 '목뼈허리뼈염좌' 를 기준으로 군집화된 용어(데이터)는 '목뼈허리뼈염좌', 목뼈염좌 허리뼈염좌','목뼈염좌|허리뼈염좌','목뼈염좌| 허리뼈염좌', '목뼈및허리뼈의염좌', '목뼈/허리뼈의염좌, '목뼈.허리뼈의염좌', '목뼝염좌|허리뼈염좌' 이다. 이와 같이, 용어에 포함된 특수기호, 띄어쓰기, 조사, 접속사를 사용함으로써, 형태가 다르게 표현되었던 용어들이 추천용어인 '목뼈허리뼈염좌'로 군집화된 것을 확인할 수 있다. In addition, the terms (data) clustered based on the recommended term'backbone sprain' are'neck back sprain, sprain back sprain,' neck sprain, back sprain, and neck sprain | These are the lumbar sprains, the'sprains of the neck and back bones', the'sprains of the neck/waist bones,' the sprains of the neck and back bones, and the'sprain sprains | the back sprains'. As such, by using special symbols, spaces, investigations, and conjunctions included in the terms, it can be seen that terms expressed in different forms are clustered into the recommended term'neck lumbar sprain'.
예시적으로 도 8을 참조하면, 데이터 결과부(150)는 데이터 군집부(140)의 군집 결과를 제공할 수 있다. 군집 결과는 추천 용어, 원본 용어, 유사도 값을 포함할 수 있다. 여기서, 원본 용어는, 칼럼에 포함된 용어일 수 있다. 달리 말해, 원본 용어는 데이터 베이스(110)에 포함된 초기값의 데이터일 수 있다. 예시적으로, 병명은 원본 용어에 해당할 수 있다. 추천 용어는 추천 점수 및 군집화 결과에 기반하여 복수의 원본 용어 중 결정된 용어 일 수 있다. 유사도는, 추천 용어와 원본 용어 사이의 유사도 값일 수 있다. 유사도는, 데이터 군집부(140)에서 음운으로 분리하여 연산된 각각의 원본 용어간의 유사도 값일 수 있다. 사용자는 데이터 결과부(150)는 추천용어에 따라 군집화 되어 있는 원본 용어 및 추천용어와 원본용어 간의 유사도를 백분율 형태로 제공할 수 있다. For example, referring to FIG. 8, the data result unit 150 may provide a cluster result of the data cluster 140. Cluster results may include recommended terms, original terms, and similarity values. Here, the original term may be a term included in the column. In other words, the original term may be data of an initial value included in the database 110. Illustratively, the disease name may correspond to the original term. The recommended term may be a term determined among a plurality of original terms based on a recommendation score and a clustering result. The similarity may be a similarity value between the recommended term and the original term. The similarity may be a similarity value between each original term calculated by separating into phonons in the data cluster 140. The user may provide the data result unit 150 in the form of a percentage between the original term clustered according to the recommended term and the recommended term and the original term.
예를 들어, 제1군집화 결과(11)의 추천 용어인 '상세불명폐렴'과 제1원본 용어인 '상세불명폐렴'과의 유사도 값은 100일 수 있다. 또한, 제1군집화 결과(11)의 추천 용어인 '상세불명폐렴'과 제2원본 용어인 '상세불명의폐렴'과의 유사도 값은 94일 수 있다. 또한, 제1군집화 결과(11)의 추천 용어인 '상세불명폐렴'과 제3원본 용어인 '상세불명 폐렴'과의 유사도 값은 100일 수 있다. 기존의 군집화 방법을 적용하면, 추천 용어인 '상세불명폐렴'과 제3원본 용어인 '상세불명 폐렴'은 띄어쓰기가 포함되어 같은 용어이지만, 군집화가 되지 않는 문제점이 발생할 수 있다. 데이터 군집부(140)의 음운으로 분리 후 유사도 연산을 수행함으로써, 띄어쓰기로 인해 군집화되지 않는 문제점을 해결할 수 있다. For example, the similarity value between the recommended term'detailed unknown pneumonia' and the first original term'detailed unknown pneumonia' in the first clustering result 11 may be 100. In addition, the similarity value between the recommended term'detailed pneumonia' in the first clustering result 11 and the second source term'detailed pneumonia' may be 94. In addition, the similarity value between the recommended term'detailed pneumonia' of the first clustering result 11 and the third source term'detailed pneumonia' may be 100. If the existing clustering method is applied, the recommended terms'detailed pneumonia' and the third source term'detailed pneumonia' are the same terms including spaces, but there may be a problem that clustering does not occur. By separating into the phonology of the data cluster 140 and performing similarity calculation, it is possible to solve the problem of not clustering due to spacing.
본원의 일 실시예예 따르면, 데이터 결과부(150)는 데이터 군집부(140)의 결과를 확인 및 저장하고 수정할 수 있다. 데이터 결과부(150)에서 확인할 수 있는 군집화 결과는 선택한 데이터 셋의 칼럼 항목에 포함된 데이터를 군집화한 결과이다. 군집화한 결과는 데이터 셋의 칼럼 항목에 포함된 본래의 데이터와 추천용어 및 추천용어와 본래 데이터 간의 유사도를 함께 보여준다. 사용자는 데이터 결과부에서 군집화 결과를 확인하고, 추천용어를 수정할 수 있다.According to one embodiment of the present application, the data result unit 150 may check, store, and modify the results of the data cluster 140. The clustering result that can be confirmed in the data result unit 150 is a result of clustering data included in a column item of the selected data set. The clustered result shows the original data contained in the column items of the data set, and the similarity between the recommended term and the recommended term and the original data. The user can check the clustering result in the data result section and correct the recommended term.
본원의 일 실시예에 따르면, 데이터 결과부(150)는 데이터 용어 군집화 결과에 기초하여 추천 용어를 수정할 수 있다. 데이터 결과부(150)는 사용자 단말로부터 제공받은 추천 용어 수정 요청에 의해, 추천 용어를 수정할 수 있다. 달리 말해, 추천 용어는 사용자의 편의나 표준화하고 싶은 용어로 사용자가 수정이 가능하다.According to one embodiment of the present application, the data result unit 150 may modify the recommended term based on the data term clustering result. The data result unit 150 may modify the recommended term by requesting to modify the recommended term provided from the user terminal. In other words, the recommended term is a term that the user wants to standardize for convenience or the user can modify it.
본원의 일 실시예에 따르면, 데이터 결과부(150)는 용어(데이터) 군집화 결과를 저장할 수 있다. 용어(데이터) 군집화가 완료된 데이터들을 대상으로 데이터 베이스(110) 등 사용자가 원하는 형태로 저장할 수 있다. 이때 원본 용어 대신에 추천 용어를 수정하는 것이 아닌 새로운 칼럼(추천 용어가 포함된 칼럼)에 저장함으로써, 사용자가 군집 결과를 재확인을 할 수 있다.According to one embodiment of the present application, the data result unit 150 may store a term (data) clustering result. The term (data) clustering can be stored in a form desired by the user, such as the database 110, for targeted data. At this time, instead of revising the recommended term instead of the original term, the user can reconfirm the cluster result by storing in a new column (column containing the recommended term).
본원의 일 실시예예 따르면, 사용자 입력 수신부(160)는 사용자 단말(미도시)로부터 용어 군집화 수행 정보를 수신할 수 있다. 용어 군집화 수행 정보는, 데이터 베이스에 포함된 데이터 셋 중 용어 군집화를 수행할 제 1 데이터 셋을 결정하는 사용자 입력 정보를 포함할 수 있다 .또한, 용어 군집화 수행 정보는, 전처리된 데이터에 포함된 원본 용어의 형태소 분리 시 품사를 결정하기 위한 사용자 입력 정보를 포함할 수 있다. 또한, 용어 군집화 수행 정보는, 음운으로 분리된 각각의 원본 용어 간의 유사도를 구분할 임계치를 설정하기 위한 사용자 입력 정보를 포함할 수 있다. According to an embodiment of the present disclosure, the user input receiving unit 160 may receive term clustering performance information from a user terminal (not shown). The term clustering performance information may include user input information for determining a first data set to perform term clustering among data sets included in the database. In addition, the term clustering performance information may include originals included in preprocessed data. When separating the morphemes of the term, it may include user input information for determining part of speech. In addition, the term clustering performance information may include user input information for setting a threshold for distinguishing similarity between each original term separated by phonology.
본원의 일 실시예에 따르면, 사용자 입력 수신부(160)는 사용자 단말(미도시)로 용어 군집화 메뉴를 제공할 수 있다. 예를 들어, 용어 군집화 장치(100)가 제공하는 어플리케이션 프로그램을 사용자 단말(미도시)이 다운로드하여 설치하고, 설치된 어플리케이션을 통해 용어 군집화 메뉴가 제공될 수 있다.According to an embodiment of the present application, the user input receiving unit 160 may provide a term clustering menu to a user terminal (not shown). For example, a user terminal (not shown) downloads and installs an application program provided by the term clustering device 100, and a term clustering menu may be provided through the installed application.
사용자 입력 수신부(160)는 사용자 단말(미도시)과 데이터, 콘텐츠, 각종 통신 신호를 네트워크를 통해 송수신하고, 데이터 저장 및 처리의 기능을 가지는 모든 종류의 서버, 단말, 또는 디바이스를 포함할 수 있다.The user input receiving unit 160 may transmit and receive data, contents, and various communication signals to and from a user terminal (not shown) through a network, and include all types of servers, terminals, or devices having functions of data storage and processing. .
사용자 단말(미도시)은 네트워크를 통해 사용자 입력 수신부(160)와 연동되는 디바이스로서, 예를 들면, 스마트폰(Smartphone), 스마트패드(Smart Pad), 태블릿 PC, 웨어러블 디바이스 등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말기 같은 모든 종류의 무선 통신 장치 및 데스크탑 컴퓨터, 스마트 TV와 같은 고정용 단말기일 수도 있다. A user terminal (not shown) is a device interworking with the user input receiving unit 160 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and a PCS (Personal Communication System). ), Global System for Mobile Communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000 , W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet), and all kinds of wireless communication devices and desktop computers, fixed terminals such as smart TVs.
사용자 입력 수신부(160) 및 사용자 단말(미도시)간의 정보 공유를 위한 네트워크의 일 예로는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 유무선 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, Wifi 네트워크, NFC(Near Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함될 수 있으며, 이에 한정된 것은 아니다.Examples of networks for sharing information between the user input receiver 160 and a user terminal (not shown) include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, and World Interoperability for Microwave Access (WIMAX). ) Network, Wired and Wireless Internet, Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth Network, Wifi Network, NFC( Near Field Communication (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, and the like may be included, but are not limited thereto.
이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.
도 9는 본원의 일 실시예에 따른 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법에 대한 동작 흐름도이다.9 is an operation flowchart of a method for clustering terms of unstructured text data for big data analysis according to an embodiment of the present application.
도 9에 도시된 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법은 앞서 설명된 용어 군집화 장치(10)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 용어 군집화 장치(10)에 대하여 설명된 내용은 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법에 대한 설명에도 동일하게 적용될 수 있다.The term clustering method of unstructured text data for analyzing big data shown in FIG. 9 may be performed by the term clustering device 10 described above. Therefore, even if omitted, the description of the term clustering device 10 may be equally applied to the description of the term clustering method of unstructured text data for big data analysis.
단계 S901에서, 용어 군집화 장치(10)는 데이터 베이스에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행할 수 있다.In step S901, the term clustering apparatus 10 may select data from a data set included in the database and perform preprocessing.
단계 S902에서, 용어 군집화 장치(10)는 전처리된 데이터에 포함된 원본 용어의 형태소를 분리하고, 원본 용어의 추천 점수를 계산할 수 있다.In step S902, the term clustering device 10 may separate the morphemes of the original terms included in the pre-processed data, and calculate a recommendation score of the original terms.
단계 S903에서, 용어 군집화 장치(10)는 원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행하고, 상기 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화할 수 있다.In step S903, the term clustering apparatus 10 may separate the phonology of the original term, perform similarity calculation between each original term separated by phonology, and cluster the original term in which the similarity calculation value is greater than or equal to a preset threshold. .
단계 S904에서, 용어 군집화 장치(10)는 상기 추천 점수 및 상기 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정하는 단계를 포함할 수 있다. In step S904, the term clustering apparatus 10 may include determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.
상술한 설명에서, 단계 S901 내지 S904은 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S901 to S904 may be further divided into additional steps or combined into fewer steps, depending on the implementation herein. Also, some steps may be omitted if necessary, and the order between the steps may be changed.
본원의 일 실시 예에 따른 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The term clustering method of unstructured text data for big data analysis according to an embodiment of the present application may be implemented in the form of program instructions that can be performed through various computer means and may be recorded in a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.
또한, 전술한 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the method for clustering terms of unstructured text data for big data analysis described above may also be implemented in the form of a computer program or application executed by a computer stored in a recording medium.
전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present application is for illustration, and a person having ordinary knowledge in the technical field to which the present application belongs will understand that it is possible to easily change to other specific forms without changing the technical spirit or essential characteristics of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.
본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims below, rather than the detailed description, and it should be interpreted that all modifications or variations derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present application.

Claims (10)

  1. 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 장치에 있어서,In the apparatus for clustering terms of unstructured text data for big data analysis,
    데이터 셋을 포함하는 데이터 베이스;A database including a data set;
    상기 데이터 베이스에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행하는 데이터 전처리부;A data pre-processor for selecting data from the data set included in the database and performing pre-processing;
    전처리된 데이터에 포함된 원본 용어의 형태소를 분리하고, 원본 용어의 추천 점수를 계산하는 추천 용어 결정부; 및A recommendation term determination unit for separating the morphemes of the original terms included in the pre-processed data and calculating a recommendation score of the original terms; And
    원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행하고, 상기 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화하는 데이터 군집부,Data clustering unit that separates the phonology of the original term, performs similarity calculation between each original term separated by phonology, and clusters the original term whose value of the similarity calculation is greater than or equal to a preset threshold.
    를 포함하되, Including,
    상기 추천 용어 결정부는, 상기 추천 점수 및 상기 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정하는 것인, 용어 군집화 장치. The recommended term determining unit determines a recommended term among a plurality of original terms based on the recommended score and the clustering result.
  2. 제1항에 있어서,According to claim 1,
    상기 전처리부는, The pre-processing unit,
    상기 데이터 베이스에 포함된 데이터 셋 중 용어 군집화를 수행할 제 1데이터 셋을 결정하고, 상기 제1데이터 셋의 복수의 칼럼 항목 중 용어 군집화를 수행할 제1칼럼을 선택하여 선택된 칼럼 항목의 데이터 전처리를 수행하는 것인, 용어 군집화 장치.Determining a first data set to perform term clustering among data sets included in the database, and selecting a first column to perform term clustering among a plurality of column items of the first data set to preprocess data of the selected column item The term clustering device, which is to perform.
  3. 제1항에 있어서,According to claim 1,
    상기 전처리부는, The pre-processing unit,
    선택된 상기 칼럼에 포함된 중복 용어 및 용어를 포함하지 않는 데이터를 제거하는 전처리를 수행하는 것인, 용어 군집화 장치.The term clustering device is to perform pre-processing to remove data that does not contain redundant terms and terms included in the selected column.
  4. 제1항에 있어서,According to claim 1,
    상기 추천 용어 결정부는,The recommended term determining unit,
    분리된 형태소의 추출빈도를 수치화한 값 및 분리된 형태소를 기반으로 추출된 가중치를 이용하여 추천 점수를 계산하고, The recommended score is calculated using the numerical value of the extraction frequency of the separated morphemes and the weight extracted based on the separated morphemes,
    상기 가중치는 원본 용어 내 형태소의 비중을 비율화한 값인 것인, 용어 군집화 장치.The weighting is a ratio of the proportion of morphemes in the original term, the term clustering device.
  5. 제4항에 있어서,The method of claim 4,
    상기 추천 점수는,The recommended score,
    상기 분리된 형태소의 추출빈도를 수치화한 값을 기반으로 분리된 형태소의 추출 빈도수 및 제1 원본 용어에서 분류된 복수의 형태소 각각의 길이를 상기 제1원본 용어의 전체길이로 나누어 연산한 결과의 합을 이용하여 계산되는 것인, 용어 군집화 장치.The sum of the result of dividing the frequency of extraction of the separated morphemes and the length of each of the plurality of morphemes classified in the first original term by the total length of the first original term based on the value obtained by quantifying the extraction frequency of the separated morpheme. The term clustering device, which is calculated using.
  6. 제1항에 있어서,According to claim 1,
    상기 데이터 군집부는, The data cluster,
    상기 원본 용어가 한글일 경우, 초성, 중성, 종성으로 한글 자모에 따른 음운으로 분리하고, 인공지능 기반의 알고리즘을 이용하여 유사도를 연산하는 것인, 용어 군집화 장치.When the original term is Hangul, separating into phonology according to Hangul Jamo as Choseong, Neutral, and Jongseong, and calculating the similarity using an artificial intelligence-based algorithm.
  7. 제1항에 있어서,According to claim 1,
    상기 데이터 군집부의 군집 결과를 제공하는 데이터 결과부를 더 포함하되, Further comprising a data result portion for providing a cluster result of the data cluster,
    상기 군집 결과는 추천 용어, 원본 용어, 유사도 값을 포함하는 것인, 용어 군집화 장치.The clustering result includes a recommended term, an original term, and a similarity value, a term clustering device.
  8. 제1항에 있어서,According to claim 1,
    사용자 단말로부터 용어 군집화 수행 정보를 수신하는 사용자 입력 수신부를 더 포함하는 것인, 용어 군집화 장치.Further comprising a user input receiving unit for receiving the term clustering performance information from the user terminal, the term clustering device.
  9. 제8항에 있어서,The method of claim 8,
    상기 추천 용어 결정부는, The recommended term determining unit,
    형태소 분리 시 상기 용어 군집화 수행 정보에 포함된 품사 결정 정보에 기반하여 상기 전처리된 데이터를 기반으로 형태소 분리를 수행하는 것인, 용어 군집화 장치.A term clustering device that performs morphological separation based on the pre-processed data based on part-of-speech determination information included in the term clustering performance information during morpheme separation.
  10. 빅데이터 분석을 위한 비정형 텍스트 데이터의 용어 군집화 방법에 있어서,In the method of clustering terms of unstructured text data for big data analysis,
    데이터 베이스에 포함된 데이터 셋에서 데이터를 선택하고 전처리를 수행하는 단계;Selecting data from a data set included in the database and performing pre-processing;
    전처리된 데이터에 포함된 원본 용어의 형태소를 분리하고, 원본 용어의 추천 점수를 계산하는 단계; 및Separating the morphemes of the original terms included in the pre-processed data, and calculating a recommendation score of the original terms; And
    원본 용어의 음운을 분리하고, 음운으로 분리된 각각의 원본 용어 간의 유사도 연산을 수행하고, 상기 유사도 연산 값이 미리 설정된 임계치 이상인 원본 용어를 군집화하는 단계; Separating the phonology of the original term, performing similarity calculation between each original term separated by phonology, and clustering the original terms whose value of the similarity calculation is greater than or equal to a preset threshold;
    상기 추천 점수 및 상기 군집화 결과에 기반하여 복수의 원본 용어 중 추천 용어를 결정하는 단계를 포함하는 것인, 용어 군집화 방법.And determining a recommended term among a plurality of original terms based on the recommendation score and the clustering result.
PCT/KR2019/002778 2018-11-26 2019-03-11 Device and method for term clustering of unstructured text data for big data analysis WO2020111395A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020180147335A KR101975419B1 (en) 2018-11-26 2018-11-26 Device and method for terminology clustering informal text data for big data analysis
KR10-2018-0147335 2018-11-26

Publications (1)

Publication Number Publication Date
WO2020111395A1 true WO2020111395A1 (en) 2020-06-04

Family

ID=66656387

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/002778 WO2020111395A1 (en) 2018-11-26 2019-03-11 Device and method for term clustering of unstructured text data for big data analysis

Country Status (2)

Country Link
KR (1) KR101975419B1 (en)
WO (1) WO2020111395A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102068715B1 (en) * 2019-06-05 2020-01-21 (주)위세아이텍 Outlier detection device and method which weights are applied according to feature importance degree
KR102046640B1 (en) * 2019-07-22 2019-12-02 (주)위세아이텍 Automatic terminology recommendation device and method for big data standardization
KR102351745B1 (en) * 2020-02-05 2022-01-17 정동윤 User Review Based Rating Re-calculation Apparatus and Method
KR102153259B1 (en) * 2020-03-24 2020-09-08 주식회사 데이터스트림즈 Data domain recommendation method and method for constructing integrated data repository management system using recommended domain
KR102362582B1 (en) * 2020-12-31 2022-02-15 렉스소프트 주식회사 Method, server and computer program product for preprocessing statistical data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101579544B1 (en) * 2014-09-04 2015-12-23 에스케이 텔레콤주식회사 Apparatus and Method for Calculating Similarity of Natural Language
KR20170037593A (en) * 2017-03-23 2017-04-04 주식회사 플런티코리아 Recommendation Reply Apparatus and Method
KR20180042710A (en) * 2016-10-18 2018-04-26 삼성에스디에스 주식회사 Method and apparatus for managing a synonymous item based on analysis of similarity
KR20180089011A (en) * 2017-01-31 2018-08-08 강태준 A System for Searching a Language Based on Big Data with a Peculiar Value
KR20180110713A (en) * 2017-03-29 2018-10-11 중앙대학교 산학협력단 Device and method for analyzing similarity of documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101579544B1 (en) * 2014-09-04 2015-12-23 에스케이 텔레콤주식회사 Apparatus and Method for Calculating Similarity of Natural Language
KR20180042710A (en) * 2016-10-18 2018-04-26 삼성에스디에스 주식회사 Method and apparatus for managing a synonymous item based on analysis of similarity
KR20180089011A (en) * 2017-01-31 2018-08-08 강태준 A System for Searching a Language Based on Big Data with a Peculiar Value
KR20170037593A (en) * 2017-03-23 2017-04-04 주식회사 플런티코리아 Recommendation Reply Apparatus and Method
KR20180110713A (en) * 2017-03-29 2018-10-11 중앙대학교 산학협력단 Device and method for analyzing similarity of documents

Also Published As

Publication number Publication date
KR101975419B1 (en) 2019-05-07

Similar Documents

Publication Publication Date Title
WO2020111395A1 (en) Device and method for term clustering of unstructured text data for big data analysis
KR101737887B1 (en) Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
US10002188B2 (en) Automatic prioritization of natural language text information
CN109815336B (en) Text aggregation method and system
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN107133212B (en) Text implication recognition method based on integrated learning and word and sentence comprehensive information
Al Wazrah et al. Sentiment analysis using stacked gated recurrent unit for arabic tweets
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
US10176256B1 (en) Title rating and improvement process and system
Hussein et al. Gender identification of egyptian dialect in twitter
CN116109732A (en) Image labeling method, device, processing equipment and storage medium
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
Huynh et al. Vietnamese text classification with textrank and jaccard similarity coefficient
JP7172187B2 (en) INFORMATION DISPLAY METHOD, INFORMATION DISPLAY PROGRAM AND INFORMATION DISPLAY DEVICE
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
Khalil et al. Which configuration works best? an experimental study on supervised Arabic twitter sentiment analysis
Pickard Comparing word2vec and GloVe for automatic measurement of MWE compositionality
Hoang et al. A comparative study on vietnamese text classification methods
CN109284391A (en) A kind of document automatic classification method
Putra et al. Rule-based Sentiment Degree Measurement of Opinion Mining of Community Participatory in the Government of Surabaya
Kongyoung et al. TLex+: a hybrid method using conditional random fields and dictionaries for Thai word segmentation
CN112529743B (en) Contract element extraction method, device, electronic equipment and medium
Langer et al. A text based drug query system for mobile phones
Chandu et al. Extractive Approach For Query Based Text Summarization
CN114722174A (en) Word extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19891546

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19891546

Country of ref document: EP

Kind code of ref document: A1