TWI490704B

TWI490704B - Related vocabulary generation system and method

Info

Publication number: TWI490704B
Application number: TW102108073A
Authority: TW
Inventors: Gen Ming Guo; yu ting Zhou; Xiu Mei Wu; Min Ru Dai; Ssu Han Chou
Original assignee: Univ Southern Taiwan Sci & Tec
Priority date: 2013-03-07
Filing date: 2013-03-07
Publication date: 2015-07-01
Also published as: TW201435624A

Description

Related vocabulary generation system and method

本發明係有關一種詞彙產生系統，係為一種相關詞彙產生系統。 The present invention relates to a vocabulary production system and is a related vocabulary production system.

在進行資料查詢的過程中，關鍵字可決定搜尋的精準度與資料的齊備程度，因此通常會採用多個相關的詞彙來進行檢索，此時須依靠使用者的文字敏銳度與對各領域文件熟悉度，才可使用有效率的關鍵字，而在一般的情況下，關鍵字容易受限於特定領域與特定語言，而無法取得更廣博而切合的資料，造成研究材料限制的情況，因此取得與關鍵字相關度高且具可靠度的相關詞彙，其在資料搜尋上可提供助力。 In the process of data query, the keyword can determine the accuracy of the search and the completeness of the data. Therefore, multiple related words are usually used for searching. At this time, the user's text acuity and the documents in various fields must be relied upon. Familiarity can only use efficient keywords, and in general, keywords are easily restricted to specific areas and specific languages, and can not obtain more broad and relevant information, resulting in research material restrictions, so Relevant vocabulary with high relevance and reliability, which can help in data search.

而目前國內資料庫之搜尋，如碩博士論文或專利資料庫，其系統僅雖提供了關鍵字中英譯的自動轉換查詢及多關鍵字查詢功能，然而未提供相關詞彙轉換查詢，因此使用者需依靠個人經驗來取得相關的各式關鍵字。 At present, the search of domestic databases, such as doctoral thesis or patent database, only provides automatic conversion query and multi-keyword query function for Chinese-English translation of keywords, but does not provide related vocabulary conversion query, so users Rely on personal experience to get a variety of relevant keywords.

本發明係一種可提供可靠度佳的相關詞彙之相關詞彙產生系統及方法。 The present invention is a related vocabulary generating system and method that can provide related vocabulary with good reliability.

因此，本發明係提供一種相關詞彙產生系統包括一可供輸入一關鍵詞彙之操作介面、一儲存數筆文件檔案之文件資料庫、一可對該文件資料庫儲存之資料進行正規化之資料解析模組、一詞彙分析模組，及一相關性模組，該詞彙分析模組可依據該資料解析模組輸出的資料與該關鍵詞彙，執行詞彙對照與重要度評估運算，該相關性模組可執行相關性與判斷運算，並計算該詞彙分析模組輸出的資料與該關鍵詞彙相關性，並驅動該操作介面顯示相關性大於或等於一預定值之詞彙。 Therefore, the present invention provides a related vocabulary generation system including an operation interface for inputting a keyword sink, a file database for storing a plurality of file files, and a data analysis for normalizing data stored in the file database. a module, a vocabulary analysis module, and a correlation module, the vocabulary analysis module can perform vocabulary comparison and importance evaluation calculation according to the data output by the data analysis module and the keyword, and the correlation module The correlation and the judgment operation may be performed, and the data output by the vocabulary analysis module is calculated and correlated with the keyword, and the operation interface is driven to display a vocabulary whose correlation is greater than or equal to a predetermined value.

上述相關詞彙產生系統之詞彙分析模組具有一同義對照單元，該同義對照單元依據一同義詞彙表，判斷與該關鍵詞彙同義之詞彙。 The vocabulary analysis module of the above related vocabulary generation system has a synonymous comparison unit, and the synonymous comparison unit judges the vocabulary synonymous with the keyword according to a synonym vocabulary.

上述相關詞彙產生系統之詞彙分析模組具有一詞頻反比文件頻率單元，及一詞頻單元，該詞頻反比文件頻率單元運用詞頻反比文件頻率演算法計算詞彙重要度，該詞頻單元依據詞彙出現頻率計算詞彙重要度。 The vocabulary analysis module of the related vocabulary generation system has a word frequency inverse ratio file frequency unit and a word frequency unit, and the word frequency inverse ratio file frequency unit uses a word frequency inverse ratio file frequency algorithm to calculate a vocabulary importance degree, and the word frequency unit calculates a vocabulary according to a vocabulary occurrence frequency. Importance.

上述相關詞彙產生系統之相關性模組具有一可儲存資料的暫存單元，及一事前機率運算單元，該事前機率運算單元可統計與計算詞彙間伴隨出現之相關性。 The correlation module of the related vocabulary generation system has a temporary storage unit capable of storing data, and an ex ante probability operation unit, which can calculate and correlate the vocabulary occurrences.

上述相關詞彙產生系統之詞彙分析模組還具有一可辨識自然語言中常用修飾詞的自然詞彙辨識單元。 The vocabulary analysis module of the above related vocabulary generation system also has a natural vocabulary recognition unit that can recognize commonly used modifiers in natural language.

上述相關詞彙產生系統之文件資料庫具有一儲存專利文件的專利資料單元、一儲存期刊文件的期刊資料單元，及一儲存碩博士論文的論文資料單元。 The document database of the above related vocabulary generation system has a patent data unit for storing a patent document, a journal data unit for storing a journal file, and a paper data unit for storing a doctoral thesis.

另外，本發明係提供一種相關詞彙產生方法包括以下步驟：資料解析步驟，正規化文件資料；關鍵字步驟，輸入一關鍵詞彙；同義對照步驟，依據一同義詞彙表對照出至少一個與該關鍵詞彙同義的詞彙；重要評估步驟，依據該資料解析步驟取得之資料，計算各詞彙之重要度；相關性步驟，依據該資料解析步驟與該重要評估步驟取得之資料，統計與計算各詞彙與該關鍵詞彙之相關性；及列表顯示步驟，顯示與該關鍵詞彙同義及相關性大於一預定值之詞彙。 In addition, the present invention provides a related vocabulary generating method including the following steps: a data parsing step, a normalized document data; a keyword step, inputting a keyword sink; a synonymous comparison step, comparing at least one keyword with the keyword according to a synonym vocabulary Synonymous vocabulary; important evaluation steps, based on the data obtained by the data analysis step, calculate the importance of each vocabulary; correlation steps, based on the data analysis step and the information obtained by the important evaluation step, statistics and calculation of each vocabulary and the key The relevance of the vocabulary; and the list display step, showing a vocabulary that is synonymous with the keyword and whose correlation is greater than a predetermined value.

本發明之具體特點與功效在於：該同義對照單元可以對照的方式取得該關鍵詞彙之同義詞彙，並且在該詞頻單元與該詞頻反比文件頻率單元可判斷各詞彙的重要度與辨別度後，該相關性模組進一步地統計運算各詞彙與該關鍵詞彙間之事前機率，來以評估各詞彙與該關鍵詞彙伴隨出現的機率，以取得其相關性再排序，以此為標準產生相關詞彙，因此可提高相關詞彙之可靠度。 The specific features and effects of the present invention are: the synonymous comparison unit can obtain the synonymous vocabulary of the keyword sink in a comparison manner, and after the word frequency unit and the word frequency inverse ratio file frequency unit can determine the importance degree and the discrimination degree of each vocabulary, The correlation module further calculates the probability of the interaction between each vocabulary and the keyword to evaluate the probability of occurrence of each vocabulary and the keyword, to obtain the relevance reordering, thereby generating relevant vocabulary as a standard. Improve the reliability of related vocabulary.

(10)‧‧‧相關詞彙產生系統 (10) ‧‧‧Related vocabulary production system

(2)‧‧‧操作介面 (2) ‧‧‧Operator interface

(3)‧‧‧文件資料庫 (3) ‧ ‧ document database

(31)‧‧‧專利資料單元 (31) ‧ ‧ Patent Data Unit

(32)‧‧‧期刊資料單元 (32) ‧‧‧Journal data unit

(33)‧‧‧論文資料單元 (33) ‧ ‧ paper data unit

(34)‧‧‧雜誌資料單元 (34) ‧ ‧ magazine data unit

(35)‧‧‧研討會資料單元 (35) ‧‧‧ Seminar data unit

(4)‧‧‧資料解析模組 (4) ‧‧‧Data Analysis Module

(5)‧‧‧詞彙分析模組 (5) ‧ ‧ vocabulary analysis module

(51)‧‧‧同義對照單元 (51)‧‧‧Synonymous comparison unit

(52)‧‧‧詞頻反比文件頻率單元 (52) ‧ ‧ word frequency inverse ratio file frequency unit

(53)‧‧‧詞頻單元 (53)‧‧‧Word frequency unit

(54)‧‧‧自然詞彙辨識單元 (54)‧‧‧Natural vocabulary identification unit

(6)‧‧‧相關性模組 (6) ‧‧‧ Relevance modules

(61)‧‧‧暫存單元 (61) ‧‧‧ temporary storage unit

(62)‧‧‧事前機率運算單元 (62) ‧‧‧Pre-probability arithmetic unit

(71)‧‧‧關鍵字步驟 (71)‧‧‧Keyword steps

(72)‧‧‧資料解析步驟 (72) ‧‧‧ Data analysis steps

(73)‧‧‧同義對照步驟 (73) ‧‧‧ Synonymous control steps

(74)‧‧‧重要評估步驟 (74) ‧ ‧ important assessment steps

(75)‧‧‧相關性步驟 (75) ‧‧‧ Relevance steps

(76)‧‧‧列表顯示步驟 (76) ‧‧‧List display steps

(77)‧‧‧關聯圖步驟 (77) ‧‧‧ association diagram steps

第一圖方塊示意圖，顯示本發明相關詞彙產生系統；第二圖係畫面示意圖，主要顯示一操作介面，及輸入一關鍵詞彙與以列表顯示相關詞彙的情況；第三圖係畫面示意圖，主要顯示該操作界面，及以圖式方式顯示相關詞彙的情況；第四圖係流程圖，顯示本發明相關詞彙產生方法；第五圖係畫面示意圖，主要顯示一操作介面，及設定詞彙出現頻率門檻值的情形。 The first block diagram shows the vocabulary generation system of the present invention; the second diagram is a schematic diagram of the screen, which mainly displays an operation interface, and inputs a keyword and displays related words in a list; The third diagram is a schematic diagram of the screen, mainly showing the operation interface and displaying the related vocabulary in a graphical manner; the fourth diagram is a flowchart showing the vocabulary generation method of the present invention; and the fifth diagram is a schematic diagram showing the operation. Interface, and set the vocabulary threshold threshold.

如第一至四圖所示，本發明一種相關詞彙產生系統(10)，一可顯示畫面並供輸入一關鍵詞彙的操作介面(2)、一儲存數筆文件檔案的文件資料庫(3)、一可對該文件資料庫(3)儲存之資料進行正規化處理的資料解析模組(4)、一詞彙分析模組(5)，及一相關性模組(6)。 As shown in the first to fourth figures, the present invention relates to a related vocabulary generation system (10), an operation interface (2) for displaying a picture and inputting a keyword, and a file database for storing a plurality of file files (3) A data analysis module (4), a vocabulary analysis module (5), and a correlation module (6) for normalizing the data stored in the document database (3).

所述文件資料庫(3)具有一專利資料單元(31)、一期刊資料單元(32)、一論文資料單元(33)、一雜誌資料單元(34)，及一研討會資料單元(35)，而其分別是用以儲存專利文件、期刊、碩博士論文、雜誌與研討會論文之檔案資料，並可經由該資料解析模組(4)正規化。 The document database (3) has a patent data unit (31), a journal data unit (32), a paper data unit (33), a magazine data unit (34), and a seminar data unit (35). And they are used to store the archives of patent documents, periodicals, master's doctoral thesis, magazines and seminar papers, and can be formalized through the data analysis module (4).

該詞彙分析模組(5)可依據該資料解析模組(4)輸出的資料與該關鍵詞彙，執行詞彙對照與重要度評估運算，並具有一自然詞彙辨識單元(54)、一同義對照單元(51)、一詞頻反比文件頻率單元(52)，及一詞頻單元(53)。所述自然詞彙辨識單元(54)可辨識與過濾自然語言中常用修飾詞，例如冠詞與量詞。該同義對照單元(51)依據數個同義詞彙表比對判斷出，以取得與該關鍵詞彙同義之詞彙。在本較佳實施例中，該同義詞彙表是中英對照的型式。 The vocabulary analysis module (5) can perform vocabulary comparison and importance evaluation calculation according to the data output by the data analysis module (4), and has a natural vocabulary identification unit (54) and a synonym comparison unit. (51), the word frequency inverse ratio file frequency unit (52), and a word frequency unit (53). The natural vocabulary recognition unit (54) can identify and filter commonly used modifiers in natural language, such as articles and quantifiers. The synonymous comparison unit (51) is determined based on a plurality of synonym vocabulary comparisons to obtain a vocabulary that is synonymous with the keyword. In the preferred embodiment, the synonym vocabulary is a Chinese-English comparison .

所述詞頻單元(53)與詞頻反比文件頻率單元(52)主要是用以計算與評估各詞彙之重要度，而該詞頻反比文件頻率單元(52)是運用詞頻反比文件頻率法(TF/IDF)評估計算詞彙重要度，也就是詞彙出現越集中於特定文件者重要性越高，另外該詞頻單元(53)則是依據詞彙出現頻率計算評估詞彙重要度，也就是重要度與出現頻率呈正相關。進一步依據評估結果淘汰重要度未達門檻值的詞彙，而使用者可設定詞彙出現頻率的門檻值(如第五圖所示)。 The word frequency unit (53) and the word frequency inverse ratio file frequency unit (52) are mainly used to calculate and evaluate the importance of each vocabulary, and the word frequency inverse ratio file frequency unit (52) is a word frequency inverse ratio file frequency method (TF/IDF). Evaluate the importance of vocabulary, that is, the more important the vocabulary appears to be focused on a particular document, and the word frequency unit (53) evaluates the vocabulary importance based on the frequency of vocabulary occurrence, that is, the importance is positively correlated with the frequency of occurrence. . Further, according to the evaluation result, the vocabulary whose importance is not reached is excluded, and the user can set the threshold value of the vocabulary occurrence frequency (as shown in the fifth figure).

該相關性模組(6)可執行相關性與判斷運算，其主要是用以計算與判斷各詞彙與該關鍵詞彙相關性，而所述同義詞彙具有一可儲存資料的暫存單元(61)，及一事前機率運算單元(62)。該事前機率運算單元(62)可統計與計算各詞彙間伴隨出現之相關性，具體來說，是利用Apriori演算法，評估在一特定詞彙出現的情況下，另一詞彙伴隨出現的機率，以此評估詞彙間的相關性，並進一步與一預定值比較，而顯示與該關鍵詞彙相關性大於或等於該預定值及同義的詞彙。而Apriori演算法是挖掘關聯規則(Association Rules,AR)常用且公知的方法，並且非本發明之特徵技術，因此不進一步說明。 The correlation module (6) can perform correlation and judgment operations, which are mainly used for calculating and judging the relevance of each vocabulary to the keyword, and the synonymous vocabulary has a temporary storage unit (61) capable of storing data. And an event probability unit (62). The ex ante probability unit (62) can calculate and calculate the correlation between the vocabulary occurrences. Specifically, the Apriori algorithm is used to evaluate the probability of occurrence of another vocabulary in the case of a specific vocabulary. This evaluates the correlation between the vocabulary and further compares it with a predetermined value, and displays a vocabulary that is more than or equal to the predetermined value and synonymous with the keyword. The Apriori algorithm is a commonly used and well-known method for mining Association Rules (AR), and is not a feature of the present invention, and therefore will not be further described.

該相關詞彙產生方法，可利用該相關詞彙產生系統(10)執行，並包括以下步驟：進行關鍵字步驟(71)：使用者於該操作介面(2)輸入一關鍵詞彙，而在本較佳實施例中，是輸入「ERP」。 The related vocabulary generating method can be executed by the related vocabulary generating system (10), and includes the following steps: performing a keyword step (71): the user inputs a keyword sink in the operation interface (2), and is preferred In the embodiment, "ERP" is input.

進行資料解析步驟(72)：該資料解析模組(4)對該文件資料庫(3) 儲存之文件檔案執行正規化運算。 Perform data analysis step (72): the data analysis module (4) is to the file database (3) The stored file file performs normalization operations.

接著，進行同義對照步驟(73)：該詞彙分析模組(5)之自然詞彙辨識單元(54)辨識與過濾自然語言中常用修飾詞，並且該同義對照單元(51)依據一同義詞彙表對照出與該關鍵詞彙同義的詞彙，在本較佳實施例中，是對照出「Enterprise Resource Planning」及「企業資源規劃」兩項與「ERP」同義的詞彙，也就是相關詞彙。 Next, a synonymous comparison step (73) is performed: the natural vocabulary recognition unit (54) of the vocabulary analysis module (5) identifies and filters commonly used modifiers in the natural language, and the synonymous comparison unit (51) is based on a synonym vocabulary In the preferred embodiment, the words "Enterprise Resource Planning" and "Enterprise Resource Planning" are synonymous with "ERP", that is, related words.

在取得同義的相關詞彙之外，還可取得與該關鍵詞彙具相關性的詞彙，而先進行重要評估步驟(74)：該詞彙分析模組(5)依據該資料解析步驟(72)取得之資料，計算各詞彙之重要度，其可利用詞頻或詞頻反比文件頻率法來計算。該詞頻單元(53)統計各詞彙出現的頻率，並且其重要度與出現頻率呈正相關。另外，該詞頻反比文件頻率單元(52)利用詞頻反比文件頻率法評估詞彙重要度，並且淘汰重要度未達門檻的詞彙，具體來說即捨去出現頻率低的詞彙。 In addition to the synonymous related vocabulary, a vocabulary related to the keyword faucet can be obtained, and an important evaluation step (74) is performed: the vocabulary analysis module (5) obtains the data analytic step (72). Data, calculate the importance of each vocabulary, which can be calculated using the word frequency or word frequency inverse file frequency method. The word frequency unit (53) counts the frequency of occurrence of each vocabulary, and its importance is positively correlated with the frequency of occurrence. In addition, the word frequency inverse ratio file frequency unit (52) uses the word frequency inverse ratio file frequency method to evaluate the vocabulary importance degree, and eliminates the vocabulary whose importance is not up to the threshold, specifically, the vocabulary with low frequency of occurrence.

再進行相關性步驟(75)：該相關性模組(6)依據該資料解析步驟(72)與該重要評估步驟(74)取得之資料，統計與計算各詞彙與該關鍵詞彙之相關性，具體來說，該相關性模組(6)之事前機率運算單元(62)利用事前機率計算，以量化各詞彙與該關鍵詞彙之相關性，再依據相關性數據分級排序。 And then performing a correlation step (75): the correlation module (6) calculates and correlates the vocabulary with the keyword according to the data obtained by the data analysis step (72) and the important evaluation step (74), Specifically, the pre- chance calculation unit (62) of the correlation module (6) uses the pre-existing probability calculation to quantify the correlation between each vocabulary and the keyword, and then sorts according to the relevance data.

進行列表顯示步驟(76)：該操作介面(2)顯示與該關鍵詞彙同義及相關性大於該預定值之詞彙，也就是以列表的方式顯示該關鍵詞彙之相關詞彙。 A list display step (76) is performed: the operation interface (2) displays a vocabulary that is synonymous with the keyword and whose correlation is greater than the predetermined value, that is, the related vocabulary of the keyword sink is displayed in a list manner.

另一方面，還可以圖式的方式呈現相關詞彙，而進行關聯圖步驟(77)：該操作介面(2)顯示與該關鍵詞彙同義及相關性大於預定值之詞彙，並且是依據相關性強弱以不同顏色顯示。具體來說，其顯示狀況是：該關鍵詞彙位於畫面中央，而與該關鍵詞彙同義的相關詞彙是以深色框底顯示，並分別以線條連接該關鍵詞彙，此外該關鍵詞彙相關性高的詞彙以淺色框底顯示並以線條連接該關鍵詞彙，而相關性較低的詞彙是以白底黑字呈現並且不與該關鍵詞彙連接。 On the other hand, the related vocabulary can also be presented in a schema manner, and the association graph step (77) is performed: the operation interface (2) displays a vocabulary that is synonymous with the keyword and has a correlation greater than a predetermined value, and is based on the correlation strength. Displayed in different colors. Specifically, the display status is: the keyword is located in the center of the screen, and the related vocabulary synonymous with the keyword is displayed at the bottom of the dark frame, and the keyword is connected by a line respectively, and the keyword is highly correlated. The vocabulary is displayed at the bottom of the light-colored frame and connected to the keyword by a line, while the less relevant vocabulary is presented in black on a white background and is not connected to the keyword.

而補充說明是的，在該重要評估步驟(74)中，使用者可選擇要評估的資料來源，也就是可選擇以專利文件、論文、期刊或雜誌其中之一為評估基本依據，而可依據使用者實際上運用的領域，來產生更貼近需求的相關詞彙。 In addition, in the important evaluation step (74), the user can select the source of the data to be evaluated, that is, one of the patent documents, papers, journals or magazines can be selected as the basis for the evaluation, and can be based on The domain that the user actually uses to produce relevant vocabulary that is closer to the needs.

本實施例的相關詞彙產生系統(10)之同義對照單元(51)可以對照的方式取得該關鍵詞彙之同義詞彙，並且在該詞頻單元(53)與該詞頻反比文件頻率單元(52)可判斷各詞彙的重要度與辨別度後，該相關性模組(6)進一步地統計運算各詞彙與該關鍵詞彙間之事前機率，來以評估各詞彙與該關鍵詞彙伴隨出現的機率，以取得其相關性再排序，也就是取得各詞彙與該關鍵詞彙相關性之強弱後，以此為標準產生相關詞彙，因此可提高相關詞彙之可靠度。 The synonymous comparison unit (51) of the related vocabulary generation system (10) of the present embodiment can obtain the synonymous vocabulary of the keyword sink in a comparison manner, and can judge the word frequency unit (53) and the word frequency inverse ratio file frequency unit (52). After the importance and discrimination of each vocabulary, the correlation module (6) further calculates the probability of each vocabulary and the keyword convergence to evaluate the probability that each vocabulary is accompanied by the keyword to obtain the vocabulary. Correlation reordering, that is, obtaining the correlation between each vocabulary and the keyword, generates relevant vocabulary as a standard, thereby improving the reliability of the related vocabulary.

另一方面在以列表方式呈現相關詞彙外，該操作介面(2)還可以圖式方式，呈現個詞彙間的關聯性，同時以顏色深淺呈現相關性的強弱，而有方便使用者辨識詞彙與該關鍵詞彙間關係之效果。 On the other hand, in addition to presenting relevant vocabulary in a list, the operation interface (2) can also graphically present the relevance of vocabulary, and at the same time, the depth of the color is correlated, and the user can identify the vocabulary and The effect of the keyword exchange relationship.

唯，以上所述者，僅為本發明之較佳實施例而已，並非用以侷限本發明之特徵，舉凡利用發明作相關之技術手段、創設原理之再創作，仍屬本發明範疇。因此本發明圖式及說明書所作之陳述並非用以限定本發明實施之範圍；任何熟習此技藝者，在不脫離本發明之精神與範圍下所作之均等變化與修飾，皆應涵蓋於本發明之專利範圍內，方為合理。 However, the above description is only a preferred embodiment of the present invention and is not intended to be limiting. The features of the present invention are still within the scope of the present invention for the re-creation of the technical means and the creation principle using the invention. Therefore, the statements of the present invention and the description of the present invention are not intended to limit the scope of the present invention; any changes and modifications made by those skilled in the art without departing from the spirit and scope of the invention are intended to be included in the invention. Within the scope of patents, it is reasonable.

(2)‧‧‧操作介面 (2) ‧‧‧Operator interface

(3)‧‧‧文件資料庫 (3) ‧ ‧ document database

(31)‧‧‧專利資料單元 (31) ‧ ‧ Patent Data Unit

(32)‧‧‧期刊資料單元 (32) ‧‧‧Journal data unit

(33)‧‧‧論文資料單元 (33) ‧ ‧ paper data unit

(34)‧‧‧雜誌資料單元 (34) ‧ ‧ magazine data unit

(35)‧‧‧研討會資料單元 (35) ‧‧‧ Seminar data unit

(4)‧‧‧資料解析模組 (4) ‧‧‧Data Analysis Module

(5)‧‧‧詞彙分析模組 (5) ‧ ‧ vocabulary analysis module

(51)‧‧‧同義對照單元 (51)‧‧‧Synonymous comparison unit

(53)‧‧‧詞頻單元 (53)‧‧‧Word frequency unit

(6)‧‧‧相關性模組 (6) ‧‧‧ Relevance modules

(61)‧‧‧暫存單元 (61) ‧‧‧ temporary storage unit

Claims

A related vocabulary generation system includes: an operation interface for inputting a keyword sink; a file database for storing a plurality of file files; and a data analysis module for normalizing the data stored in the file database; The vocabulary analysis module can perform vocabulary comparison and importance evaluation calculation according to the data output by the data analysis module and the keyword, and the lexical analysis module has a synonymous comparison unit, and the synonym control unit is based on at least one different language. a synonym vocabulary for judging a vocabulary synonymous with the keyword, the synonym vocabulary containing at least one Chinese-English comparison; and a correlation module for performing correlation and judgment operations, and calculating the output of the lexical analysis module The data is correlated with the keyword, and drives the operation interface to display a vocabulary whose correlation is greater than or equal to a predetermined value. The correlation module has a temporary storage unit capable of storing data, and an ex ante probability unit, the probability of the event The arithmetic unit can calculate and correlate the companion between vocabulary and use the Apriori algorithm to evaluate In the case where a specific vocabulary appears, another vocabulary is accompanied by the probability of occurrence, thereby evaluating the correlation between vocabulary and further comparing with a predetermined value, and displaying a vocabulary having a correlation with the keyword greater than or equal to the predetermined value and synonymous The operation interface displays words that are synonymous with the keyword and whose correlation is greater than a predetermined value, and are displayed in different colors according to the correlation strength, and the keyword is Located in the center of the screen, the relevant vocabulary synonymous with the keyword is displayed at the bottom of the dark frame, and the keyword is connected by a line, and the vocabulary with high relevance is displayed at the bottom of the light frame and connected by lines. Key words sink, while less relevant words are presented in black on white and are not connected to the keyword.

For example, the related vocabulary generation system described in claim 1 wherein the vocabulary analysis module has a word frequency inverse ratio file frequency unit and a word frequency unit, and the word frequency inverse ratio file frequency unit uses word frequency inverse ratio file frequency algorithm to calculate vocabulary importance. Degree, the word frequency unit calculates the vocabulary importance according to the frequency of vocabulary occurrence.

For example, the related vocabulary generation system described in claim 1 has a patent data unit for storing a patent document, a journal data unit for storing a journal file, and a paper data unit for storing a doctoral thesis.

The related vocabulary generation system of claim 1, wherein the vocabulary analysis module further has a natural vocabulary recognition unit that can recognize commonly used modifiers in natural language.

A related vocabulary generating method comprises the following steps: a data parsing step, a normalized document data, a keyword step, inputting a keyword sink, and a synonymous comparison step, comparing at least one synonym with the keyword according to a synonym vocabulary of at least one different language. Vocabulary, the synonym vocabulary contains at least one Chinese-English comparison; the important evaluation step, based on the data obtained by the data analysis step, calculates the importance of each vocabulary; The correlation step is based on the data analysis step and the data obtained by the important evaluation step, and the correlation between the vocabulary and the keyword is calculated and calculated, and the correlation probability unit is used to calculate and correlate the collocation between the vocabulary. And using the Apriori algorithm to evaluate the probability of occurrence of another vocabulary in the case of a specific vocabulary, to evaluate the correlation between vocabulary, and further compare with a predetermined value, and display the correlation with the keyword is greater than Or a predetermined value and a synonymous vocabulary; and a list display step of displaying a vocabulary that is synonymous with the keyword and having a correlation greater than a predetermined value; further presenting the related vocabulary in a schema manner, and performing the association graph step: displaying and The keyword is synonymous with a vocabulary that is more than a predetermined value and is displayed in a different color depending on the strength of the correlation. The keyword is located in the center of the screen, and the related vocabulary synonymous with the keyword is displayed at the bottom of the dark frame, and the keyword is connected by a line, and the keyword with high relevance is displayed at the bottom of the light frame and Lines connect to the keyword sink, while less relevant words are presented in black on white and are not connected to the keyword.

For example, in the related vocabulary generating method described in claim 5, in the important evaluation step, the importance is calculated according to the vocabulary occurrence, and the importance is proportional to the number of vocabulary occurrences.

The related vocabulary generating method according to claim 5, wherein in the important evaluation step, the word frequency inverse ratio file frequency algorithm is used to calculate the vocabulary importance.