TWI712949B - Method for calculating a semantic similarity - Google Patents

Method for calculating a semantic similarity Download PDF

Info

Publication number
TWI712949B
TWI712949B TW108118443A TW108118443A TWI712949B TW I712949 B TWI712949 B TW I712949B TW 108118443 A TW108118443 A TW 108118443A TW 108118443 A TW108118443 A TW 108118443A TW I712949 B TWI712949 B TW I712949B
Authority
TW
Taiwan
Prior art keywords
word
sentence
words
feature
variable
Prior art date
Application number
TW108118443A
Other languages
Chinese (zh)
Other versions
TW202044103A (en
Inventor
黃本聰
陳建亨
Original Assignee
雲義科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 雲義科技股份有限公司 filed Critical 雲義科技股份有限公司
Priority to TW108118443A priority Critical patent/TWI712949B/en
Publication of TW202044103A publication Critical patent/TW202044103A/en
Application granted granted Critical
Publication of TWI712949B publication Critical patent/TWI712949B/en

Links

Images

Abstract

A method for calculating a semantic similarity includes the steps of: inputting a sentence to be parsed; removing a pleonasm of the sentence according to a default pleonasm of a label; performing antonym identification on the sentence according to a default antonym of the label; substituting synonyms for words of the sentence according to a default synonym of the label; performing characteristic-word identification on the sentence according to a default characteristic word of the label to obtain a standard sentence; and calculating a similarity between the standard sentence and the label to output a response sentence semantically corresponding to the standard sentence.

Description

語意相似度計算方法 Semantic similarity calculation method

本發明是有關一種相似度計算方法,特別是指一種語意相似度計算方法。 The present invention relates to a similarity calculation method, in particular to a semantic similarity calculation method.

隨著科技的日新月異,人類與智慧型電子裝置間的溝通模式已透過最自然且方便的語音來進行,近年來互動訴求為主的機器人陸續被發表。 With the rapid development of technology, the communication mode between humans and smart electronic devices has been carried out through the most natural and convenient voice. In recent years, robots with interactive appeals have been published one after another.

目前較為熟知的人機互動技術有一種是先針對使用者可能會表達的話語或問題,於機器人身上內建預先設好的對話資料庫,當機器人接收到語音訊息即與內建的對話資料庫進行比對,以辨識該語音訊息之語意,而進行的互動對談,惟若需達到雙向互動溝通的功能,設備所需對話資料量甚鉅,單以人工方式建置該對話資料庫,除了需耗費龐大的時間與人力成本,更會增加存放資料庫的記憶體空間,且建置後的對話資料庫若無持續擴充更新,幾次後使用者將對機器人喪失新鮮感。 One of the more familiar human-computer interaction technologies is to first build a pre-built dialogue database on the robot for the words or questions that the user may express. When the robot receives the voice message, it will talk to the built-in dialogue database. In order to identify the semantic meaning of the voice message, the interactive conversation is conducted. However, if the function of two-way interactive communication is required, the equipment requires a huge amount of dialogue data. The dialogue database is built manually, except It takes a huge amount of time and labor costs, and it will increase the memory space for storing the database. If the established dialogue database is not continuously expanded and updated, the user will lose the freshness of the robot after a few times.

另一種技術則是透過神經網路進行深度學習,於實務上神經網路大多利用超級電腦或單晶片系統來實現。在使用單晶片系統的情況下,是令單晶片系統中的同一套電路在不同時間點分別扮演多層人造神經網路中不同的運算層,當神經網路的層數愈多,愈能模擬複雜的函式(亦即較複雜的判斷規則),然而,當層數增加,整個網路中所需要的神經元數量會隨之大幅增長,將衍生龐大的硬體成本負擔,且各個運算層的輸入資料、可學習參數與 運算結果的資料數量都非常可觀,絕非一般企業得以負擔。 Another technology is deep learning through neural networks. In practice, neural networks are mostly implemented using supercomputers or single-chip systems. In the case of a single-chip system, the same set of circuits in the single-chip system can play different computing layers in a multilayer artificial neural network at different time points. The more layers of the neural network, the more complex the simulation can be. However, when the number of layers increases, the number of neurons required in the entire network will increase substantially, which will result in a huge hardware cost burden, and the cost of each computing layer Input data, learnable parameters and The amount of data of the calculation results is very considerable, which is by no means the burden of ordinary enterprises.

上述缺點都顯現習知人機互動技術在使用上所衍生的種種問題,依目前人工智慧的發展,要達到人機自主互動仍屬不易,畢竟語言是人類經過長期學習與經驗累積的文化產物,因此,如何利用有限度的對話資料庫,且快速擷取使用者之語意進行分析,為一重要課題。 The above shortcomings all show the various problems derived from the use of conventional human-computer interaction technology. According to the current development of artificial intelligence, it is still not easy to achieve autonomous human-computer interaction. After all, language is a cultural product of humans through long-term learning and experience. Therefore, How to use the limited dialogue database and quickly capture the user’s semantics for analysis is an important topic.

有鑑於此,本發明之目的,是提供一種語意相似度計算方法,包含下列步驟。 In view of this, the purpose of the present invention is to provide a semantic similarity calculation method, which includes the following steps.

輸入一待解析的語句,並將該語句與每一標示詞預設之雜詞進行去雜詞處理,提取該語句中的字詞與每一標示詞預設之反意詞進行反意詞檢查,將該語句與每一標示詞預設之相似詞進行相似詞的置換,該語句與每一標示詞預設之特徵詞進行特徵檢查,以獲取一語意解析後的規則語句,對所述之規則語句與該標示詞進行相似度計算,以輸出一與該規則語句之語意相對應的響應語句。 Input a sentence to be parsed, and process the sentence and the pre-set vocabulary of each marker word to remove the miscellaneous words, and extract the words in the sentence and the pre-set antonym of each marker word for antonym checking , The sentence is replaced with similar words preset for each marker word, and the sentence is checked with the characteristic words preset for each marker word to obtain a semantically parsed rule sentence. The similarity between the rule sentence and the marked word is calculated to output a response sentence corresponding to the semantic meaning of the rule sentence.

本發明的另一技術手段,是在於該語句與每一標示詞預設之常數特徵詞先進行特徵詞檢查,再與每一標示詞預設之變數特徵詞進行特徵詞檢查,且標示詞之特徵詞包括至少一常數特徵詞、至少一變數特徵詞,或其組合,而每一變數特徵詞具有複數個與該變數特徵詞相關之關聯特徵詞。 Another technical means of the present invention is that the sentence and the predetermined constant feature words of each tag word are first checked for feature words, and then the feature word check is performed with the variable feature words preset for each tag word, and the difference between the tag words The feature words include at least one constant feature word, at least one variable feature word, or a combination thereof, and each variable feature word has a plurality of associated feature words related to the variable feature word.

本發明的又一技術手段,是在於該語句與每一標示詞預設之常數特徵詞先進行特徵詞檢查,再與每一標示詞預設之變數特徵詞進行特徵詞檢查,且該標示詞之特徵詞包括至少一常數特徵詞、至少一變數特徵詞,或其組合,每一變數特徵詞具有複數個與該變數特徵詞相關之關聯特徵詞,且複數變數特徵詞間互為交集關係。 Another technical means of the present invention is that the sentence and the constant feature word preset for each marker word are first checked for feature words, and then the feature word check is performed with the variable feature words preset for each marker word, and the marker word The feature words include at least one constant feature word, at least one variable feature word, or a combination thereof, each variable feature word has a plurality of associated feature words related to the variable feature word, and the plural variable feature words are in an intersection relationship with each other.

本發明的再一技術手段,是在於上述之複 數變數特徵詞有先後之排列順序。 Another technical means of the present invention lies in the above complex The characteristic words of the number variables have the sequence of order.

本發明的另一技術手段,是在於上述是先對該常數特徵詞進行檢查,再對該變數特徵詞進行檢查。 Another technical means of the present invention is that the above is to check the constant feature word first, and then check the variable feature word.

本發明的又一技術手段,是在於上述需同時符合進行特徵檢查之標示詞的常數特徵詞與該變數特徵詞,才可獲取該規則語句。 Another technical means of the present invention is that the above-mentioned constant feature word and the variable feature word that need to be matched with the signature word for feature checking are required to obtain the rule sentence.

本發明的再一技術手段,是在於上述之響應語句提取對應之標示詞的常數特徵詞或該變數特徵詞分別設置有至少一常數回應特徵詞、至少一變數回應特徵詞,或其組合,且該常數回應特徵詞與該變數回應特徵詞是對應該常數特徵詞與該變數特徵詞之順序設置。 Another technical means of the present invention is that the above-mentioned response sentence extracts the constant feature word corresponding to the marker word or the variable feature word is respectively provided with at least one constant response feature word, at least one variable response feature word, or a combination thereof, and The constant response characteristic word and the variable response characteristic word are set corresponding to the sequence of the constant characteristic word and the variable characteristic word.

本發明的另一技術手段,是在於上述進行相似度計算會依據該常數特徵詞與該變數特徵詞的先後排列順序進行特徵詞檢查。 Another technical means of the present invention is that the above-mentioned similarity calculation will perform a feature word check based on the sequence of the constant feature word and the variable feature word.

本發明的又一技術手段,是在於當該規則語句與該標示詞進行相似度計算後,將該響應語句與一預設的匹配度閥值進行檢查,並保留大於所述匹配度閥值的響應語句,以輸出該響應語句。 Another technical means of the present invention is to check the response sentence against a preset matching degree threshold after the similarity between the rule sentence and the marked word is calculated, and to retain those that are greater than the matching degree threshold Response sentence to output the response sentence.

本發明的再一技術手段,是在於當該語句與該規則庫中的任一標示詞都無法匹配時,即根據該語句的字詞與一廣泛規則庫中的標示詞進行匹配,以獲取一依據該語句之字詞所得的廣泛回應語句。 Another technical means of the present invention is that when the sentence cannot be matched with any marked word in the rule base, that is, according to the word of the sentence and a marked word in an extensive rule base, a match is obtained. A broad response sentence based on the words of the sentence.

本發明之有益功效在於,藉由在該標示詞設置至少一個常數特徵詞、至少一個變數特徵詞或兩者之組合,且變數特徵詞更設置有與該變數特徵詞本身相關之關聯特徵詞,以與多元化使用者語句的表達方式進行特徵詞檢查,並透過相對應設置的變數回應特徵詞,可以有多種不同的回答,除了可減少人力設置標示詞的時間與電腦之運算時間外,更可大幅提升人機互動之靈活性,以滿足 不同領域、場合之使用需求。 The beneficial effect of the present invention is that by setting at least one constant feature word, at least one variable feature word or a combination of the two in the tag word, and the variable feature word is further provided with an associated feature word related to the variable feature word itself, The feature word check is performed in the expression mode of the diversified user sentence, and the feature word is responded to by the corresponding set of variables. There can be a variety of different answers. In addition to reducing the time for manpower to set the marker word and the computing time of the computer, Can greatly enhance the flexibility of human-computer interaction to meet Use requirements in different fields and occasions.

91~97‧‧‧步驟 91~97‧‧‧Step

圖1是一流程示意圖,說明本發明語意相似度計算方法之較佳實施例。 Fig. 1 is a schematic flowchart illustrating a preferred embodiment of the semantic similarity calculation method of the present invention.

有關本發明之相關申請專利特色與技術內容,在以下配合參考圖式之較佳實施例的詳細說明中,將可清楚的呈現。 The features and technical content of the related patent applications of the present invention will be clearly presented in the following detailed description of the preferred embodiments with reference to the drawings.

參閱圖1,為本發明語意相似度計算方法之較佳實施例,適用於對使用者與機器人溝通過程的語意進行解析,並產生相對應的回應,該方法包含下列步驟。 Refer to FIG. 1, which is a preferred embodiment of the semantic similarity calculation method of the present invention. It is suitable for analyzing the semantics of the communication between the user and the robot and generating a corresponding response. The method includes the following steps.

首先,進行步驟91,輸入一待解析的語句,並將該語句與每一標示詞預設之雜詞進行去雜詞處理,去雜詞指將該問句中的贅詞去除,而預設之雜詞可以是0個或是複數個,例如請問、假如、比如、像是...等口語無意義的字詞,而輸入之語句可以是由使用者直接與機器人對談,或是擷取語音所得,再將語音轉成文字或文字轉成語音等過程非本發明之技術重點,於此不多贅述。於此,使用下表1標示詞欄位”你喜歡{xq0}{xq1}”作為本實施例的說明,而由去雜詞欄位列出有預設之去雜詞如:相對而言、比較、看等等。 First, proceed to step 91, input a sentence to be parsed, and process the sentence and the pre-defined miscellaneous words of each tag word to remove the miscellaneous words. The miscellaneous word removal refers to removing the superfluous words in the question sentence, and preset The miscellaneous words can be zero or plural, for example, ask, if, for example, like... and other spoken meaningless words, and the input sentence can be directly talked with the robot by the user, or captured The process of taking the speech and then converting the speech to text or text to speech is not the technical focus of the present invention, and will not be repeated here. Here, use the labeled word field "Do you like {xq0}{xq1}" in Table 1 below as the description of this embodiment, and the de-clutter field lists preset de-clutter words such as: relatively speaking, Compare, see, etc.

接著,進行步驟92,提取該語句中的字詞與每一標示詞預設之反意詞進行反意詞檢查。反意詞是依據與每一標示詞相反的用詞所設,而預設之反意詞可以是0個或是複數個,例如標示詞中願意的反意詞可以是不願意、不愛、不想、不需要、不要等等,若於本步驟出現該標示詞預設之反意詞,意指與該標示詞語意不同而轉與其他標示詞進行檢查。預設之標示詞可針對醫院、學校、遊 樂園、百貨公司等其中一特定領域或場合使用的標示詞詞庫,用以作為該場合之客服諮詢使用。如下表1之標示詞的反意詞預設為明星。 Then, proceed to step 92 to extract the words in the sentence and the preset antonyms of each tag word for antonym checking. The antonyms are based on the words opposite to each marker word, and the preset antonyms can be zero or plural. For example, the antonyms of willing in the marker words can be unwilling, unloving, Don’t want, don’t need, don’t, etc., if the default antonym of the tag word appears in this step, it means that it has a different meaning from the tag word and will be checked with other tag words. The preset tag words can be targeted at hospitals, schools, travel A dictionary of marked words used in a particular field or occasion, such as parks, department stores, etc., is used for customer service consultation on that occasion. The antonyms of the marker words in Table 1 below are preset to be stars.

Figure 108118443-A0101-12-0005-1
Figure 108118443-A0101-12-0005-1

然後,進行步驟93,將該語句與每一標示詞預設之相似詞進行相似詞的置換,相似詞是依每一標示詞所設,而預設之相似詞可以是0個或是複數個,例如上表1標示詞中喜歡的相似詞欄位列出有喜愛、喜愛、偏愛等等,或是國父與孫中山等等。實際實施時,亦可先進行步驟93相似詞置換再進行步驟92反意詞檢查,不應以此為限。 Then, proceed to step 93 to replace the sentence with the preset similar words of each tag word. The similar words are set according to each tag word, and the preset similar words can be 0 or plural. For example, the column of similar words like favorites in the marked words of Table 1 above lists likes, favorites, preferences, etc., or the founding father and Sun Yat-sen, etc. In actual implementation, it is also possible to perform step 93 similar word replacement first and then perform step 92 antonym check, which should not be limited to this.

接著,進行步驟94,該語句與每一標示詞預設之常數特徵詞進行特徵詞檢查,如上表1常數特徵詞欄位預設為你喜歡,若出現”你喜歡”意指與該標示詞語意相同而進行下一步驟,反之則轉與其他標示詞進行檢查。 Then, proceed to step 94. The sentence and the preset constant feature words of each tag word are checked for feature words. As shown in Table 1 above, the constant feature word field is preset as you like. If "you like" appears, it means that the tag word If the meaning is the same, proceed to the next step, otherwise, check with other marked words.

然後,進行步驟95,該語句與每一標示詞預設之變數特徵詞進行特徵詞檢查,以獲取一語意解析後的規則語句。該標示詞之特徵詞包括至少一常數特徵詞、至少一變數特徵詞,或其組合,此外,每一變數特徵詞具有複數個與該變數特徵詞相關之關聯特徵詞。於此,該特徵詞可以是名詞、動詞、動名詞或形容詞,如上表1變數特徵詞欄位預設有籃球、舞蹈、跳舞等。 Then, in step 95, the sentence and the variable characteristic words preset for each tag word are checked for characteristic words to obtain a semantically parsed rule sentence. The feature words of the marker words include at least one constant feature word, at least one variable feature word, or a combination thereof. In addition, each variable feature word has a plurality of associated feature words related to the variable feature word. Here, the characteristic word can be a noun, a verb, a gerund or an adjective. As shown in Table 1, the variable characteristic word column is preset with basketball, dancing, dancing, etc.

進一步地,該複數變數特徵詞間互為交集關係,且該複數變數特徵詞有先後之排列順序,如上表1之標示詞中的{xq0}{xq1}。過程中是先對該常數特徵詞進行檢查,再對該變數特徵詞進行檢查,再者,需同時符合進行特徵檢查之標示詞的常數特徵詞與該變數特徵詞,才可獲取該規則語句。 Further, the feature words of the plural variables are in an intersection relationship with each other, and the feature words of the plural variables have a sequence, such as {xq0}{xq1} in the marker words in Table 1 above. In the process, the constant feature word is checked first, and then the variable feature word is checked. Furthermore, the constant feature word and the variable feature word of the marker word for feature checking must be met at the same time to obtain the rule sentence.

接著,進行步驟96,對所述之規則語句與該標示詞進行相似度計算,以輸出一與該規則語句之語意相對應的響應語句,如上表1響應語句欄位預設有:我比較喜歡{xq1}些、這兩種運動我都喜歡!、{xq0}跟{xq1}都是很不錯的運動!。當{xq0}、{xq1}分別為籃球與足球時,響應語句為:我比較喜歡足球些、這兩種運動我都喜歡!、籃球跟足球都是很不錯的運動! Next, proceed to step 96 to calculate the similarity between the rule sentence and the marked word to output a response sentence corresponding to the semantics of the rule sentence. As shown in Table 1, the response sentence field is preset to: I prefer {xq1}Some, I like both of these sports! , {Xq0} and {xq1} are great exercises! . When {xq0} and {xq1} are basketball and football respectively, the response sentence is: I prefer football, and I like both of these sports! , Basketball and football are both very good sports!

在該步驟96中,該響應語句提取對應之標示詞的常數特徵詞與該變數特徵詞分別設置有至少一常數回應特徵詞、至少一變數回應特徵詞,或其組合,且該常數回應特徵詞與該變數回應特徵詞是對應該常數特徵詞與該變數特徵詞之順序設置。特別注意的是,進行相似度計 算會依據該常數特徵詞與該變數特徵詞的先後排列順序進行特徵詞檢查。 In step 96, the response sentence extraction corresponding to the constant feature word of the tag word and the variable feature word are respectively provided with at least one constant response feature word, at least one variable response feature word, or a combination thereof, and the constant response feature word The variable response characteristic word is set corresponding to the sequence of the constant characteristic word and the variable characteristic word. Pay special attention to the similarity meter Calculate will check the characteristic words according to the sequence of the constant characteristic words and the variable characteristic words.

最後,進行步驟97,當該規則語句與該標示詞進行相似度計算後,將該響應語句與一預設的匹配度閥值進行檢查,並保留大於所述匹配度閥值的響應語句,以輸出該響應語句,反之,則不輸出該響應語句。一般來說相似度大約會設定在70~80%左右,雖於一開始即進行去雜詞處理,為免於有殘留未被完全去除之無用的雜詞,透過預設的匹配度閥值,可確保計算所得之結果的正確性。 Finally, proceed to step 97. After the similarity between the rule sentence and the marked word is calculated, the response sentence is checked against a preset matching degree threshold, and the response sentence that is greater than the matching degree threshold is retained. The response sentence is output, otherwise, the response sentence is not output. Generally speaking, the similarity will be set at about 70~80%. Although the de-cluttering process is performed from the beginning, in order to avoid the useless clutter that has not been completely removed, the preset matching degree threshold is used. To ensure the correctness of the calculated results.

特別說明的是,當該語句與預設任一標示詞都無法匹配時,也就是進行反意詞檢查、特徵詞檢查、相似度計算的其中之一無法匹配,即根據該語句的特徵字詞與一廣泛規則庫中的標示詞詞庫進行匹配,以獲取一依據該語句之特徵字詞所得的廣泛回應語句。 In particular, when the sentence cannot be matched with any of the preset marked words, that is, one of the antonym check, the feature word check, and the similarity calculation fails to match, that is, according to the feature word of the sentence Matching with a marked word database in an extensive rule database to obtain a broad response sentence based on the characteristic words of the sentence.

假設語意為X1,X2,X3...,X1有X11,X12,X1...X1n等n種表達方式,並依此類推X2,X3,假設使用者的問句或待解析語句為Y1,當要計算Y1是否在表達X1的語意時,習知作法是將Y1與X11,X12,X13,...X1n分別去計算相似度,因此必須建立n筆標示詞(Label),也就是X11,X12,X13,...X1n,此舉會耗費建置時間與人力成本。本發明解決必須建立n筆標示詞的繁雜作法,也就是以其中一筆標示詞代表X11,X12,..X1n等n筆表達方式,用X11做為標示詞代表X11-X1n,X21-X2n,X31-X3n....到Xmn等m種語意,使其得到m x n種的表達方式,除了可節省建置規則庫的時間外,更可縮短計算時間,以提升人機互動之即時性。 Suppose the semantic meaning is X1, X2, X3..., X1 has n expressions such as X11, X12, X1...X1n, and so on X2, X3, suppose the user’s question or sentence to be parsed is Y1, When calculating whether Y1 is expressing the semantic meaning of X1, the conventional practice is to calculate the similarity between Y1 and X11, X12, X13,...X1n respectively, so it is necessary to establish n pen label words (Label), which is X11, X12, X13,...X1n, this will consume construction time and labor costs. The present invention solves the complicated method of establishing n-stroke marker words, that is, one marker word represents X11, X12,...X1n and other n-stroke expressions, and X11 is used as marker word to represent X11-X1n, X21-X2n, X31 -X3n.... to Xmn and other m semantics, so that it can get mxn expressions, in addition to saving the time to build the rule base, it can also shorten the calculation time to improve the real-timeness of human-computer interaction.

首先,舉進行反意詞檢查的例子來說,標示詞為你喜歡足球嗎,若該語句為你喜歡足球嗎,在進行相似度計算其相似度極高,反之,若該語句為你不喜歡足 球嗎,其相似度也很高,雖然兩句只有一個”不”字不同,可是兩者的語意是截然不同且相反的,因此,透過反意詞檢查步驟,可於此步驟將輸入語句”你不喜歡足球嗎”設定為與標示詞”你喜歡足球嗎”作出比對不匹配的結果,並轉與其他標示詞進行比對,不需進行與標示詞”你喜歡足球嗎的後續置換、檢查或分析等步驟,以節省分析時間與提升分析準確度。 First of all, take the example of antonym checking. The marked word is Do you like football? If the sentence is Do you like football, the similarity is extremely high during the similarity calculation. On the contrary, if the sentence is that you don’t like football. foot The similarity of the ball is also very high. Although the two sentences only have a different word "no", the semantics of the two sentences are completely different and opposite. Therefore, through the antonym check step, the sentence can be input at this step." Do you don’t like football” is set to match the result of the mismatch with the tag word “Do you like football”, and then compare it with other tag words. There is no need to perform subsequent replacements with the tag word “Do you like football?” Steps such as inspection or analysis to save analysis time and improve analysis accuracy.

接著,再以標示詞中包括不包括變數特徵詞的例子來說,標示詞X11為”你{喜歡}{足球}嗎”進行說明,括號內的喜歡與足球為常數特徵詞,語言表達方式多元例如X12「你愛踢足球嗎」、X13「你愛看足球嗎」、X14「你喜不喜歡足球」、X15「老實說,你喜歡看足球比賽嗎」、X16「你不喜歡足球嗎」,將語句X12~X16分別輸入時,X13的看與X15的老實說為預設且無意義的雜詞,會在該步驟91中去除,X12的愛踢、X13的愛看、X14的喜不喜歡、X15的喜歡看,與標示詞的喜歡為預設的相似詞,在該步驟93置換為喜歡,X16的你不喜歡足球嗎與標示詞你喜歡足球嗎為預設的相似詞,亦會在該步驟93中被置換。 Next, let’s take the example of the marked words including and not including variable characteristic words. The marked word X11 is "you{like}{足球}?". The words like and football in parentheses are constant characteristic words, and the language expressions are diverse. For example, X12 "Do you like football?" X13 "Do you like football?" X14 "Do you like football?" X15 "Honestly, do you like football games?" X16 "Do you not like football?", When the sentences X12~X16 are input separately, X13's Kan and X15's are honestly preset and meaningless words, which will be removed in this step 91, X12's love kick, X13's love to watch, X14's like or not , X15 likes to watch, and the marked word likes are the preset similar words. In this step 93, replace with like, X16 do you not like football and the marked word Do you like football are the preset similar words, and will also be in It is replaced in this step 93.

然後,進行特徵詞檢查,在此步驟喜歡與足球兩個特徵詞需同時存在,且先檢查到有特徵詞”喜歡”才進行特徵詞”足球”的檢查,之後X12會由你愛踢足球嗎變成你喜歡足球嗎進行相似度計算;X13會由你愛看足球嗎變成你喜歡足球嗎進行相似度計算;X14會由你喜不喜歡足球變成你喜歡足球進行相似度計算;X15會由老實說,你喜歡看足球比賽嗎變成你喜歡足球進行相似度計算;X16會由你不喜歡足球嗎變成你喜歡足球嗎進行相似度計算。經過本設計的處理流程,X12~X16在計算相似度時,都可以得到很高的相似值,以認知其與標示詞(X11) 有相同的語意。 Then, perform the feature word check. In this step, the two feature words like like and football must exist at the same time, and the feature word “like” is checked first before the feature word “football” is checked. After that, X12 will you love to play football? Do you like football for similarity calculation; X13 will change from whether you like watching football to do you like football for similarity calculation; X14 will change from whether you like football to you like football for similarity calculation; X15 will be honestly , Do you like watching football matches become you like football for similarity calculation; X16 will be changed from Do you not like football to Do you like football for similarity calculation. After the processing flow of this design, X12~X16 can get a very high similarity value when calculating the similarity, so as to recognize it and the label word (X11) Have the same meaning.

另外,舉以標示詞中包括有一個變數特徵詞的例子來說,當標示詞為”你喜歡{足球/籃球/排球/瑜珈…}嗎”,括號為一個變數特徵詞,並設定有與足球相關之關聯特徵詞籃球、排球、瑜珈...等m個關連特徵詞,而該響應語句的變數回應特徵詞對應該變數特徵詞設置,括號為一個變數回應特徵詞,該響應語句回答設計可以是”我最愛{踢足球/籃球/排球/瑜珈…}了”、{足球/籃球/排球/瑜珈…}也是我喜歡的項目之一、{足球/籃球/排球/瑜珈…}有點難度,我沒有很喜歡等多種設計回答。 In addition, take an example of a variable feature word included in the tag word. When the tag word is "Do you like {soccer/basketball/volleyball/yoga...}", the parenthesis is a variable feature word, and it is set as Related related feature words basketball, volleyball, yoga... and other m related feature words, and the variable response feature word of the response sentence corresponds to the variable feature word setting. The bracket is a variable response feature word. The response sentence can be designed to answer It’s "I love {playing football/basketball/volleyball/yoga...} the most", {football/basketball/volleyball/yoga...} is also one of my favorites, {football/basketball/volleyball/yoga...} is a bit difficult, I I don’t like waiting for multiple designs to answer.

進一步地,當輸入之語句為”你喜歡籃球嗎”,進行特徵詞檢查時符合變數特徵詞中的關聯特徵詞”籃球”,該語句改為”你喜歡籃球嗎”,之後進行相似度計算得到很高的相似度,最後,得到該響應語句的回答為”我最愛籃球了”,或是”籃球也是我喜歡的項目之一”,亦或是”籃球有點難度,我沒有很喜歡”等多樣化的回答。而若輸入語句為”你喜歡排球嗎”,該響應語句的對應回答為”排球有點難度,我沒有很喜歡”。 Further, when the input sentence is "Do you like basketball?", when the feature word check is performed, it matches the associated feature word "Basketball" in the variable feature word, and the sentence is changed to "Do you like basketball", and then the similarity is calculated Very high similarity. In the end, the answer to the response sentence is "I love basketball the most", or "Basketball is also one of my favorite items", or "Basketball is a bit difficult, I don't like it very much", etc. The answer. And if the input sentence is "Do you like volleyball", the corresponding answer of the response sentence is "Volleyball is a bit difficult, I don't like it very much".

綜上所述,本發明語意相似度計算方法,藉由在該標示詞設置至少一個常數特徵詞、至少一個變數特徵詞或兩者之組合,且變數特徵詞更設置有與該變數特徵詞本身相關之關聯特徵詞,以與多元化語句的表達方式進行特徵詞檢查,並透過相對應設置的變數回應特徵詞,可以有多種不同的回答,除了可減少人力設置標示詞的時間與電腦之運算時間外,更可大幅提升人機互動之靈活性,以滿足不同領域、場合之使用需求,故確實可以達成本發明之目的。 In summary, the semantic similarity calculation method of the present invention sets at least one constant feature word, at least one variable feature word, or a combination of the two in the tag word, and the variable feature word is further set with the variable feature word itself Related related feature words are checked by the expression of multiple sentences, and the feature words are responded to by corresponding variables. There can be a variety of different answers, in addition to reducing the time of manpower setting up the tag words and computer calculations In addition to time, it can greatly increase the flexibility of human-computer interaction to meet the needs of different fields and occasions, so it can indeed achieve the purpose of the invention.

惟以上所述者,僅為本發明之較佳實施例而已,當不能以此限定本發明實施之範圍,即大凡依本發 明申請專利範圍及發明說明內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。 However, the above are only preferred embodiments of the present invention, and should not be used to limit the scope of implementation of the present invention, that is, according to the present invention. The simple equivalent changes and modifications made to the scope of the patent application and the description of the invention are still within the scope of the patent for the present invention.

91~97‧‧‧步驟 91~97‧‧‧Step

Claims (7)

一種語意相似度計算方法,包含下列步驟:輸入一待解析的語句,並將該語句與每一標示詞預設之雜詞進行去雜詞處理;提取該語句中的字詞與每一標示詞預設之反意詞進行反意詞檢查;將該語句與每一標示詞預設之相似詞進行相似詞的置換;該語句與每一標示詞預設之特徵詞進行特徵詞檢查,以獲取一語意解析後的規則語句,在進行該語句的特徵詞檢查時,該語句與每一標示詞預設之常數特徵詞先進行特徵詞檢查,再與每一標示詞預設之變數特徵詞進行特徵詞檢查,且該標示詞之特徵詞包括至少一常數特徵詞、至少一變數特徵詞,或其組合,而每一變數特徵詞具有複數個與該變數特徵詞相關之關聯特徵詞,且該複數變數特徵詞有先後之排列順序;及對所述之規則語句與該標示詞進行相似度計算,以輸出一與該規則語句之語意相對應的響應語句。 A method for calculating semantic similarity includes the following steps: input a sentence to be parsed, and perform de-cluttering processing on the sentence and the pre-set vocabulary of each tag word; extract the words in the sentence and each tag word Perform antonym check for preset antonyms; perform similar word replacement between the sentence and the preset similar words of each tag word; perform feature word check for the sentence and the preset feature words of each tag word to obtain For a rule sentence after semantic analysis, when checking the characteristic words of the sentence, the sentence and the constant characteristic words preset for each marker word are first checked for characteristic words, and then with the variable characteristic words preset for each marker word Feature word check, and the feature word of the mark word includes at least one constant feature word, at least one variable feature word, or a combination thereof, and each variable feature word has a plurality of associated feature words related to the variable feature word, and The characteristic words of the plural variables are arranged in a sequence; and the similarity calculation is performed on the rule sentence and the marker word to output a response sentence corresponding to the semantic meaning of the rule sentence. 依據申請專利範圍第1項所述之語意相似度計算方法, 其中,在進行該語句的特徵詞檢查時,該語句與每一標示詞預設之常數特徵詞先進行特徵詞檢查,再與每一標示詞預設之變數特徵詞進行特徵詞檢查,且該標示詞之特徵詞包括至少一常數特徵詞、至少一變數特徵詞,或其組合,每一變數特徵詞具有複數個與該變數特徵詞相關之關聯特徵詞,且該複數變數特徵詞間互為交集關係。 According to the semantic similarity calculation method described in item 1 of the scope of patent application, Wherein, when the feature word check of the sentence is performed, the sentence and the constant feature word preset for each tag word are first checked for feature words, and then the feature word check is performed with the variable feature words preset for each tag word, and the The feature words of the marker words include at least one constant feature word, at least one variable feature word, or a combination thereof. Each variable feature word has a plurality of associated feature words related to the variable feature word, and the plural variable feature words are mutually exclusive Intersection relationship. 依據申請專利範圍第2項所述之語意相似度計算方法,其中,在進行該語句的特徵詞檢查時,需同時符合進行特徵檢查之標示詞的常數特徵詞與該變數特徵詞,才可獲取該規則語句。 According to the semantic similarity calculation method described in item 2 of the scope of patent application, when the feature word check of the sentence is performed, the constant feature word of the marker word for feature check and the variable feature word must be met at the same time to obtain The rule statement. 依據申請專利範圍第2項所述之語意相似度計算方法,其中,在進行相似度計算時,該響應語句提取對應之標示詞的常數特徵詞或該變數特徵詞分別設置有至少一常數回應特徵詞、至少一變數回應特徵詞,或其組合,且該常數回應特徵詞與該變數回應特徵詞是對應該常數特徵詞與該變數特徵詞之順序設置。 According to the semantic similarity calculation method described in item 2 of the scope of patent application, in the similarity calculation, the response sentence extracts the constant feature word corresponding to the tag word or the variable feature word is respectively provided with at least one constant response feature Word, at least one variable response feature word, or a combination thereof, and the constant response feature word and the variable response feature word are set corresponding to the sequence of the constant feature word and the variable feature word. 依據申請專利範圍第2項所述之語意相似度計算方法,其中,在進行相似度計算時,會依據該常數特徵詞與該變數特徵詞的先後排列順序進行特徵詞檢查。 According to the semantic similarity calculation method described in item 2 of the scope of patent application, in the similarity calculation, the characteristic word check is performed according to the sequence of the constant characteristic word and the variable characteristic word. 依據申請專利範圍第1項所述之語意相似度計算方法,其中,在進行相似度計算時,當該規則語句與該標示詞進行相似度計算後,將該響應語句與一預設的匹配度閥值進行檢查,並保留大於所述匹配度閥值的響應語句, 以輸出該響應語句。 According to the semantic similarity calculation method described in item 1 of the scope of patent application, in the similarity calculation, after the similarity calculation between the rule sentence and the marker word, the response sentence is matched with a preset matching degree The threshold is checked, and the response sentences that are greater than the matching threshold are retained, To output the response sentence. 依據申請專利範圍第1項所述之語意相似度計算方法,其中,當該語句與該規則庫中的任一標示詞都無法匹配時,即根據該語句的字詞與一廣泛規則庫中的標示詞進行匹配,以獲取一依據該語句之字詞所得的廣泛回應語句。 According to the semantic similarity calculation method described in item 1 of the scope of patent application, when the sentence cannot be matched with any marked word in the rule base, it is based on the words of the sentence and the words in an extensive rule base. The marked words are matched to obtain a broad response sentence based on the words of the sentence.
TW108118443A 2019-05-28 2019-05-28 Method for calculating a semantic similarity TWI712949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW108118443A TWI712949B (en) 2019-05-28 2019-05-28 Method for calculating a semantic similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW108118443A TWI712949B (en) 2019-05-28 2019-05-28 Method for calculating a semantic similarity

Publications (2)

Publication Number Publication Date
TW202044103A TW202044103A (en) 2020-12-01
TWI712949B true TWI712949B (en) 2020-12-11

Family

ID=74668175

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108118443A TWI712949B (en) 2019-05-28 2019-05-28 Method for calculating a semantic similarity

Country Status (1)

Country Link
TW (1) TWI712949B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7644047B2 (en) * 2003-09-30 2010-01-05 British Telecommunications Public Limited Company Semantic similarity based document retrieval
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
CN107944027A (en) * 2017-12-12 2018-04-20 苏州思必驰信息科技有限公司 Create the method and system of semantic key index
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7644047B2 (en) * 2003-09-30 2010-01-05 British Telecommunications Public Limited Company Semantic similarity based document retrieval
US9672206B2 (en) * 2015-06-01 2017-06-06 Information Extraction Systems, Inc. Apparatus, system and method for application-specific and customizable semantic similarity measurement
CN107944027A (en) * 2017-12-12 2018-04-20 苏州思必驰信息科技有限公司 Create the method and system of semantic key index
CN109062892A (en) * 2018-07-10 2018-12-21 东北大学 A kind of Chinese sentence similarity calculating method based on Word2Vec

Also Published As

Publication number Publication date
TW202044103A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
Li et al. Know more say less: Image captioning based on scene graphs
US11520991B2 (en) Method, apparatus, electronic device and storage medium for processing a semantic representation model
CN109783657B (en) Multi-step self-attention cross-media retrieval method and system based on limited text space
CN111259653B (en) Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
Haug et al. Neural multi-step reasoning for question answering on semi-structured tables
CN110851599A (en) Automatic scoring method and teaching and assisting system for Chinese composition
CN111832278B (en) Document fluency detection method and device, electronic equipment and medium
Tiwari et al. Ensemble approach for twitter sentiment analysis
Yan et al. Implicit emotional tendency recognition based on disconnected recurrent neural networks
Wahde et al. DAISY: an implementation of five core principles for transparent and accountable conversational AI
Langlet et al. Modelling user’s attitudinal reactions to the agent utterances: focus on the verbal content
Rajput et al. Big data and social/medical sciences: state of the art and future trends
WO2023169301A1 (en) Text processing method and apparatus, and electronic device
Alshammari et al. TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM
Sun et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning
TWI712949B (en) Method for calculating a semantic similarity
CN114817510B (en) Question and answer method, question and answer data set generation method and device
Malviya et al. HDRS: Hindi dialogue restaurant search corpus for dialogue state tracking in task-oriented environment
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
CN114328823A (en) Database natural language query method and device, electronic equipment and storage medium
Khandait et al. Automatic question generation through word vector synchronization using lamma
Alharahseheh et al. A survey on textual entailment: Benchmarks, approaches and applications
Zhang et al. Research on answer selection based on LSTM
CN113254590A (en) Chinese text emotion classification method based on multi-core double-layer convolutional neural network
CN112101037A (en) Semantic similarity calculation method