TWI662425B

TWI662425B - A method of automatically generating semantic similar sentence samples

Info

Publication number: TWI662425B
Application number: TW107105170A
Authority: TW
Inventors: 王昊; 陳見聳; 高鵬
Original assignee: 大陸商芋頭科技(杭州)有限公司
Priority date: 2017-02-27
Filing date: 2018-02-13
Publication date: 2019-06-11
Also published as: TW201841121A; CN108509409A; WO2018153215A1

Abstract

本發明提供了一種自動生成語義相近句子樣本的方法，屬於語言處理技術領域；方法包括：獲取句子樣本並進行分詞處理；採用詞向量模型得到與每個詞語的語義相近的相近詞的集合；分別從集合中選取一個相近詞並替換詞語，以形成語義相近句子樣本；採用語言模型，分別針對每個語義相近句子樣本生成用於表示語義可能性的可能值，並按照可能值由高至低對所有語義相近句子樣本排序；選取並保留前N個語義相近句子樣本，以根據被保留的語義相近句子樣本進行後續處理步驟。上述技術方案的有益效果是：能夠在不需要大量的後續句子集合的情況下，自動生成大批量語義相近的句子樣本，省去了大量的人力工作。The invention provides a method for automatically generating semantically similar sentence samples, which belongs to the field of language processing technology. The method includes: obtaining sentence samples and performing word segmentation processing; using a word vector model to obtain a set of similar words close to the semantics of each word; respectively Select a similar word from the set and replace the word to form a sample of semantically similar sentences; use a language model to generate possible values that represent the possibility of semantics for each sample of semantically similar sentences, and compare the possible values from high to low according to the possible values Sort all semantically similar sentence samples; select and retain the first N semantically similar sentence samples for subsequent processing steps based on the retained semantically similar sentence samples. The above technical solution has the beneficial effect that it can automatically generate a large number of sentence samples with similar semantics without the need for a large number of subsequent sentence sets, which saves a lot of manual work.

Description

Method for automatically generating semantically similar sentence samples

本發明涉及自然語言處理技術領域，尤其涉及一種自動生成語義相近句子樣本的方法。The invention relates to the technical field of natural language processing, in particular to a method for automatically generating semantically similar sentence samples.

現有技術中，在自然語言的處理過程中，很多處理任務都需要大量語義相近的句子或句式集合，這些語義相近的句子或句式的集合通常需要人工來編寫，因此會耗費大量的人力和時間。In the prior art, in the process of natural language processing, many processing tasks require a large number of sentences or sentence sets with similar semantics. These semantically similar sentences or sentence sets need to be written manually, so it will consume a lot of manpower and time.

隨著自動化技術的發展，越來越多的語義相近句子的編寫過程可以由自動化的方式實現。目前大批量獲得語義相近的句子集合的方式主要有以下幾種：With the development of automation technology, more and more semantically similar sentences can be written in an automated way. At present, there are mainly several ways to obtain semantically similar sentence sets in the following ways:

（1）採用檢索式的方式獲取大批量的語義相近句子。所謂檢索式方式，是指在大量的候選句子中通過一定的檢索式找到語義相近的句子集合。這種方法應用的前提首先是需要有大量的候選句子集合，並且在採用檢索式查找並生成語義相近句子的過程中，對於語義相似度查找模組的性能要求非常高，即語義相似度查找模組的性能，決定了採用檢索式方式獲取的語義相近句子的精確程度。(1) A large number of semantically similar sentences are obtained in a retrieval mode. The so-called search method means to find a set of sentences with similar semantics through a certain search formula among a large number of candidate sentences. The premise of the application of this method is that it needs a large number of candidate sentence sets. In the process of using search to find and generate semantically similar sentences, the performance requirements of the semantic similarity search module are very high, that is, the semantic similarity search module. The performance of the group determines the accuracy of the semantically similar sentences obtained by the retrieval method.

（2）採用sequence to sequence的方式獲取大批量的語義相近句子。這種方式目前在學術科研領域的研究非常活躍，但是採用這種方式在實際應用中生成的很多句子並不合理，其性能並不是很好，因此缺乏一定的實用性。(2) Use sequence to sequence to obtain a large number of semantically similar sentences. This method is currently very active in the field of academic research, but many sentences generated by this method in practical applications are not reasonable, and their performance is not very good, so it lacks certain practicability.

根據現有技術中存在的上述問題，現提供一種自動生成語義相近句子樣本的方法的技術方案，旨在有效地自動生成大批量的語義相近的句子樣本，省去了大量的人力工作。According to the above-mentioned problems in the prior art, a technical solution of a method for automatically generating semantically similar sentence samples is now provided, which aims to efficiently and automatically generate a large number of semantically similar sentence samples, saving a lot of manual work.

上述技術方案具體包括：一種自動生成語義相近句子樣本的方法，適用於自然語言處理的過程中；其中，預先訓練並形成用於處理得到語義相近的詞語的詞向量模型，以及用於判斷生成的語義相近句子樣本的語義可能性的語言模型，還包括：步驟S1，獲取外部輸入的句子樣本；步驟S2，對句子樣本進行分詞處理，以將句子樣本分解為包括多個依序排列的詞語的組合；步驟S3，採用詞向量模型，分別得到與句子樣本中包括的每個詞語的語義相近的相近詞的集合；步驟S4，分別從與每個詞語相對應的集合中選取一個相近詞並替換詞語，以形成關聯於句子樣本的語義相近句子樣本；步驟S5，判斷集合中是否還有尚未被選取的相近詞：若有，則返回步驟S4；步驟S6，採用語言模型，分別針對每個語義相近句子樣本生成用於表示語義可能性的可能值，並按照可能值由高至低對所有語義相近句子樣本排序；步驟S7，選取並保留前N個語義相近句子樣本，以根據被保留的語義相近句子樣本進行後續處理步驟。The above technical solution specifically includes: a method for automatically generating a sample of semantically similar sentences, which is applicable to the process of natural language processing; among which, a word vector model for training and forming semantically similar words is pre-trained and used to judge the generated semantically similar words; The language model of semantic possibility of semantically similar sentence samples further includes: step S1, obtaining externally input sentence samples; step S2, performing word segmentation processing on the sentence samples to decompose the sentence samples into a plurality of sequentially arranged words Combination; step S3, using a word vector model, to obtain a set of similar words with similar semantics to each word included in the sentence sample; step S4, selecting a similar word from the set corresponding to each word and replacing it Words to form a semantically similar sentence sample associated with the sentence sample; Step S5, determine whether there are similar words that have not been selected in the set: If so, return to Step S4; Step S6, use the language model for each semantics separately Similar sentence samples generate possible values that represent semantic possibilities, and Energy descending sentences semantically similar for all samples sorting; a subsequent processing step step S7, to select and retain the first N samples sentences semantically similar to similar samples semantic sentence reserved.

較佳者，該自動生成語義相近句子樣本的方法，其中，句子樣本的類型包括：句子類型，句子類型的句子樣本中包括依序排列的多個詞語；句式類型，句式類型的句子樣本中包括依序排列的多個詞語和多個詞語的詞類標籤，或者句式類型的句子樣本中包括依序排列的多個詞類標籤；步驟S1具體包括：步驟S11，獲取外部輸入的句子樣本；步驟S12，判斷句子樣本的類型：若句子樣本為句式類型，則轉向步驟S13；若句子樣本為句子類型，則直接轉向步驟S2；步驟S13，將句子樣本中的每個詞類標籤分別替換成對應於詞類標籤的一高頻詞，以形成完整的句子樣本，隨後轉向步驟S2。Preferably, the method for automatically generating semantically similar sentence samples, wherein the types of sentence samples include: sentence types, sentence samples of sentence types include a plurality of words arranged in sequence; sentence types, sentence samples of sentence types Step S1 includes multiple words and multiple part-of-speech tags arranged sequentially, or sentence-type sentence samples include multiple part-of-speech tags arranged sequentially. Step S1 specifically includes: Step S11, obtaining an externally input sentence sample; Step S12, determine the type of the sentence sample: if the sentence sample is a sentence type, go to step S13; if the sentence sample is a sentence type, go directly to step S2; step S13, replace each part of speech tag in the sentence sample with A high-frequency word corresponding to the part-of-speech tag to form a complete sentence sample, and then proceeds to step S2.

較佳者，該自動生成語義相近句子樣本的方法，其中，採用一預設的分詞方法預先訓練並形成詞向量模型；則步驟S2中，採用預設的分詞方法對句子樣本進行分詞處理。Preferably, the method for automatically generating semantically similar sentence samples, wherein a preset word segmentation method is used to pre-train and form a word vector model; then in step S2, a sentence segmentation method is used to perform word segmentation processing on the sentence samples.

較佳者，該自動生成語義相近句子樣本的方法，其中，步驟S4中，被選取並用於替換的相近詞與被替換的詞語之間具有相同的詞性。Preferably, the method for automatically generating a sample of semantically similar sentences, wherein, in step S4, the similar words selected and used for replacement have the same part of speech as the replaced words.

較佳者，該自動生成語義相近句子樣本的方法，其中，步驟S6中，每個語義相近句子樣本的可能值為用於表示每個語義相近句子樣本作為一個完整的句子成立的可能性的語義學評分。Preferably, the method for automatically generating a sample of semantically similar sentences, wherein, in step S6, the possible value of each sample of semantically similar sentences is a semantic value used to indicate the possibility that each sample of semantically similar sentences is established as a complete sentence. Academic scoring.

較佳者，該自動生成語義相近句子樣本的方法，其中，語義相近句子樣本的類型包括：句子類型，句子類型的語義相近句子樣本中包括依序排列的多個詞語；句式類型，句式類型的語義相近句子樣本中包括依序排列的多個詞語和多個詞語的詞類標籤，或者句式類型的句子樣本中包括依序排列的多個詞類標籤；則步驟S7具體包括：步驟S71，選取並保留前N個語義相近句子樣本；步驟S72，判斷是否需要輸出句式類型的語義相近句子樣本：若是，則轉向步驟S73；若否，則轉向步驟S74；步驟S73，將語義相近句子樣本中包括的詞語替換成對應的詞類標籤，以形成完整的語義相近句子樣本，隨後進行後續處理步驟；步驟S74，根據被保留的語義相近句子樣本進行後續處理步驟。Preferably, the method for automatically generating semantically similar sentence samples, wherein the types of semantically similar sentence samples include: sentence type, the semantically similar sentence samples of sentence type include a plurality of words arranged in sequence; sentence type, sentence style The semantically similar sentence samples of the type include multiple words and word class tags arranged sequentially, or the sentence sample of the sentence type includes multiple word class tags arranged sequentially. Then step S7 specifically includes: step S71, Select and retain the first N semantically similar sentence samples; Step S72, determine whether it is necessary to output semantically similar sentence samples of the sentence type: if yes, go to step S73; if not, go to step S74; step S73, sample the semantically similar sentences The words included in the words are replaced with corresponding part-of-speech tags to form a complete sample of semantically similar sentences, followed by subsequent processing steps; step S74, the subsequent processing steps are performed according to the retained samples of semantically similar sentences.

上述技術方案的有益效果是：提供一種自動生成語義相近句子樣本的方法，能夠在不需要大量的後續句子集合的情況下自動生成大批量的語義相近的句子樣本，省去了大量的人力工作。The above technical solution has the beneficial effect of providing a method for automatically generating sentence samples with similar semantics, which can automatically generate a large number of sentence samples with similar semantics without the need for a large number of subsequent sentence sets, saving a lot of manual work.

以下將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有作出創造性勞動的前提下所獲得的所有其他實施例，都屬本發明保護的範圍。In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

需要說明的是，在不衝突的情況下，本發明中的實施例及實施例中的特徵可以相互組合。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.

以下結合附圖和具體實施例對本發明作進一步說明，但不作為本發明的限定。The present invention is further described below with reference to the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

基於現有技術中存在的上述問題，現提供一種自動生成語義相近句子樣本的方法，該方法適用於自然語言處理的過程中。Based on the above problems in the prior art, a method for automatically generating semantically similar sentence samples is now provided, and the method is applicable to the process of natural language processing.

上述方法中，預先訓練並形成用於處理得到語義相近的詞語的詞向量模型，以及用於判斷生成的語義相近句子樣本的語義可能性的語言模型。In the above method, a word vector model for processing and obtaining semantically similar words and a language model for judging the semantic possibility of the generated semantically similar sentence samples are formed and trained in advance.

上述方法具體如圖1所示，包括：步驟S1，獲取外部輸入的句子樣本；步驟S2，對句子樣本進行分詞處理，以將句子樣本分解為包括多個依序排列的詞語的組合；步驟S3，採用詞向量模型，分別得到與句子樣本中包括的每個詞語的語義相近的相近詞的集合；步驟S4，分別從與每個詞語相對應的集合中選取一個相近詞並替換詞語，以形成關聯於句子樣本的語義相近句子樣本；步驟S5，判斷集合中是否還有尚未被選取的相近詞：若有，則返回步驟S4；步驟S6，採用語言模型，分別針對每個語義相近句子樣本生成用於表示語義可能性的可能值，並按照可能值由高至低對所有語義相近句子樣本排序；步驟S7，選取並保留前N個語義相近句子樣本，以根據被保留的語義相近句子樣本進行後續處理步驟。The above method is specifically shown in FIG. 1 and includes: Step S1, obtaining an externally input sentence sample; Step S2, performing a word segmentation process on the sentence sample to decompose the sentence sample into a combination including a plurality of sequentially arranged words; Step S3 , Using the word vector model, to obtain a set of similar words with similar semantics to each word included in the sentence sample; step S4, selecting a similar word from the set corresponding to each word and replacing the word to form Step S5, determine whether there are similar words that have not been selected in the set: If so, go back to Step S4; Step S6, use the language model to generate each semantically similar sentence sample separately It is used to represent the possible values of semantic possibility, and all samples of semantically similar sentences are sorted according to the possible values from high to low. Step S7, selecting and retaining the first N samples of semantically similar sentences to perform according to the retained samples of semantically similar sentences. Follow-up processing steps.

本實施例中，上述詞向量模型可以採用一些將詞表徵為實數值向量的工具形成，例如Word2vec，該工具可以利用深度學習的思想，通過訓練把對文本內容的處理簡化為K維向量空間中的向量運算，而向量空間上的相似度可以用來表示文本語義上的相似度。上述詞向量是指採用神經網路對語言模型進行建模，同時獲得一種單詞在向量空間上的表示，採用詞向量對詞語進行處理就能根據詞語之間的相似度獲得該詞語的相近詞。In this embodiment, the above word vector model can be formed by using some tools that represent words as real-valued vectors, such as Word2vec. This tool can use the idea of deep learning to simplify the processing of text content into a K-dimensional vector space through training. Vector operations, and the similarity in vector space can be used to represent the semantic similarity of text. The above word vector refers to modeling a language model using a neural network, and simultaneously obtaining a word representation in a vector space. By processing a word using the word vector, a similar word of the word can be obtained based on the similarity between the words.

於本實施例中，訓練形成上述詞向量模型的訓練樣本可以為大量的文本數據，這些文本數據來源可以是不同論壇上的文本數據，並且在輸入前需要經過分詞處理。In this embodiment, the training samples trained to form the above word vector model may be a large amount of text data, and these text data sources may be text data on different forums, and need to undergo word segmentation processing before input.

在經過上述詞向量模型後，輸出的應當為用於表示詞語的低維度的實數向量，訓練語料中每個詞都應當對應一個低維度的實數向量。After passing through the above word vector model, the output should be a low-dimensional real number vector used to represent words, and each word in the training corpus should correspond to a low-dimensional real number vector.

上述實數向量通常可以表示成[0.792, −0.177, −0.107, 0.109, −0.542, ...]或類似的形式，維度以 50 維和 100 維比較常見。則詞與詞之間的向量的距離可以用最傳統的歐氏距離來衡量，也可以用 cos 夾角來衡量。用這種方式表示的向量，“麥克”和“話筒”的距離會遠遠小於“麥克”和“天氣”。例如。可以採用計算cos夾角的方式來計算相似度，從而得到指定詞語的相近詞。計算其他詞與指定詞的相似度的過程中，相似度較高的即為相近詞。The above real number vectors can usually be expressed as [0.792, −0.177, −0.107, 0.109, −0.542, ...] or similar forms, and the dimensions are more common in 50 and 100 dimensions. The distance between the word and the vector of words can be measured by the most traditional Euclidean distance, or by the angle of cos. For vectors expressed in this way, the distance between "Mike" and "Microphone" will be much smaller than "Mike" and "Weather". E.g. The similarity can be calculated by calculating the angle of cos to obtain the similar words of the specified word. In the process of calculating the similarity between other words and the specified word, the higher similarity is the similar word.

於本實施例中，上述語言模型可以為用來計算一個句子的成句概率的模型，例如表示為P(W1,W2,...Wk)。利用語言模型，可以確定哪個詞序列是句子的可能性更大，或者給定若干個詞，可以預測下一個最可能出現的詞語。簡單說，語言模型用來判斷幾個詞組成的詞序列是不是符合人說話的習慣，即該詞序列是句子的可能性。在本發明的一個較佳的實施例中，上述語言模型可以採用n-gram模型實現。In this embodiment, the language model may be a model used to calculate a sentence completion probability of a sentence, for example, expressed as P (W1, W2, ... Wk). Using a language model, you can determine which word sequence is more likely to be a sentence, or given a number of words, you can predict the next most likely word. In short, the language model is used to determine whether a word sequence composed of several words conforms to the habit of speaking, that is, the possibility that the word sequence is a sentence. In a preferred embodiment of the present invention, the language model may be implemented by using an n-gram model.

具體地，在對語言模型進行訓練的過程中，輸入模型的是經過分詞處理的各文本句子，輸出的可以為各文本句子中詞語搭配組合的概率。Specifically, in the process of training a language model, each text sentence that is input to the model is processed by word segmentation, and the output can be the probability of collocation and combination of words in each text sentence.

則本實施例中，上述步驟S1中，獲取外部輸入的句子樣本有可能是通過人工輸入的，也有可能通過連接外部的句子樣本數據庫獲得。所獲取的句子樣本可以為純隨機的句子樣本，只需要遵循最基本的語義學規則即可，例如符合語義學上構成句子的必要條件，並且是一句通順的句子即可。In this embodiment, in step S1 above, the sentence samples obtained from the external input may be manually input or may be obtained by connecting to an external sentence sample database. The obtained sentence samples can be purely random sentence samples, and only need to follow the most basic semantic rules, such as conforming to the necessary conditions for semantically forming a sentence, and being a smooth sentence.

本實施例中，上述步驟S2中，對每個句子樣本分別進行分詞處理，因此能夠把一個句子樣本分解成包括多個依序排列的詞語的組合。例如對於一個句子樣本“我要聽周杰倫的青花瓷”，則經過分詞後形成的即為“我+要+聽+周杰倫+的+青花瓷”，其中需要在後續步驟中關注的應該為具有具體含義的詞語，例如名詞“周杰倫”和名詞“青花瓷”。進一步地，在上述句子樣本中的每個詞語都具有一對應的詞類標籤，例如“周杰倫”的詞類標籤為“歌手”（在計算機處理過程中可能以“singer”來表示），“青花瓷”的詞類標籤為“歌曲”（在計算機處理過程中可能以“song”來表示）等。本實施例中，上述詞類標籤也可以被稱為該詞語的標籤。In this embodiment, in step S2 described above, word segmentation processing is performed on each sentence sample separately, so a sentence sample can be decomposed into a combination including a plurality of words arranged in order. For example, for a sentence sample "I want to listen to Jay Chou's blue and white porcelain", after the word segmentation is formed "I + to + listen + Jay Chou + + blue and white porcelain", which should be followed in the next steps should have specific meanings Words such as the noun "Jay Chou" and the noun "blue and white porcelain". Further, each word in the above sentence sample has a corresponding part-of-speech tag. For example, the part-of-speech tag of "Jay Chou" is "singer" (which may be represented by "singer" during computer processing). The part-of-speech tag is "song" (which may be represented by "song" in computer processing) and so on. In this embodiment, the part-of-speech tag may also be referred to as a tag of the word.

於本實施例中，對句子樣本進行分詞處理完畢後，根據每個詞語採用詞向量模型處理得到其對應的相近詞的集合。具體地，所謂相近詞，是指與該詞語的詞類一致的語義相近的詞語，例如對於“周杰倫”來說，其標籤為“歌手”，則根據詞向量模型處理得到的對應該標籤的相近詞可能有“王力宏”、“陶喆”、“陳奕迅”以及“那英”等，則根據詞向量模型能夠處理得到上述相近詞的集合並輸出。相應地，若對於“周杰倫”來說其標籤為“男歌手”（在計算機處理過程中可能以“male-Singer”來表示），則對應該標籤的相近詞可能有“王力宏”、“陶喆”以及“陳奕迅”等。換言之，不同詞語對應的標籤決定了該詞語的相近詞的集合。In this embodiment, after the word segmentation processing is performed on the sentence samples, a word vector model processing is performed according to each word to obtain a corresponding set of similar words. Specifically, the so-called similar words refer to words with similar semantic meanings consistent with the part-of-speech of the word. For example, for "Jay Chou", the label is "singer", and the similar words corresponding to the label are processed according to the word vector model. There may be "Wang Lihong", "Tao Yun", "Chen Yixun", and "Na Ying", etc. According to the word vector model, the set of similar words can be processed and output. Correspondingly, if Jay Chou's label is "male singer" (may be expressed as "male-Singer" during computer processing), the similar words corresponding to the label may have "Wang Lihong" and "Tao Yan" And "Eason Chan". In other words, the labels corresponding to different words determine the set of similar words for that word.

本實施例中，上述步驟S4中，分別從與每個詞語相對應的集合中選取一個相近詞並替換詞語，以形成關聯於句子樣本的語義相近句子樣本。例如，對應一個句子樣本可能存在a個詞語，即一個句子樣本由a個詞語依序排列形成，並且針對每個詞語具有一個相近詞集合，每個集合內部具有b個語義與該詞語最相近的相近詞，則一個句子樣本可能對應存在ba個語義相近句子樣本，即針對一個句子樣本存在一個語義相近句子樣本的集合，針對多個句子樣本就可能存在多個語義相近句子樣本的集合，因此能夠實現自動生成大批量的語義相近句子樣本。In this embodiment, in step S4 above, a similar word is selected from the set corresponding to each word and the word is replaced to form a semantically similar sentence sample associated with the sentence sample. For example, there may be a word corresponding to a sentence sample, that is, a sentence sample is formed by a sequence of a word, and for each word, there is a set of similar words, and each set has b semantically closest words to the word. Similar words, a sentence sample may correspond to ba semantically similar sentence samples, that is, a set of semantically similar sentence samples exists for one sentence sample, and multiple sets of semantically similar sentence samples may exist for multiple sentence samples. Automatically generate a large number of semantically similar sentence samples.

本實施例中，上述步驟S5為對相近詞集合的循環選擇，即上述步驟S4-S5實現的為針對一批輸入的句子樣本生成大批量的語義相近句子樣本的操作。In this embodiment, the above step S5 is a cyclic selection of the similar word set, that is, the operations implemented in the above steps S4-S5 to generate a large number of semantically similar sentence samples for a batch of input sentence samples.

本實施例中，在生成語義相近句子樣本時，有些語義相近句子樣本可能由於單純相近詞的堆砌造成語義上的不通暢，從而不能作為一個正常的句子樣本進入後續處理。因此在上述步驟S6中，在生成語義相近句子樣本後，需要採用上述預先訓練並生成的語言模型對每個語義相近句子樣本的語義可能性進行分析，最終可以針對每個語義相近句子樣本生成用於表示該句子的語義可能性的可能值，該可能值可以用於表示該句子在語義學上的合理性。隨後根據該可能值由高至低對語義相近句子樣本進行排列。具體地，對於給定句子S=W1,W2,...,Wk，其中S用於標記句子，Wk(k=1,2,3……)用於表示該句子中的第k個詞語。In this embodiment, when generating semantically similar sentence samples, some semantically similar sentence samples may not be semantically smooth due to the stacking of simple similar words, and thus cannot be used as a normal sentence sample for subsequent processing. Therefore, in the above step S6, after generating the semantically similar sentence samples, the above-trained and generated language model needs to be used to analyze the semantic possibility of each semantically similar sentence sample. Finally, the semantically similar sentence sample can be generated for For the possible value representing the semantic possibility of the sentence, the possible value can be used to represent the semantic rationality of the sentence. Then, the semantically similar sentence samples are ranked from highest to lowest according to the possible value. Specifically, for a given sentence S = W1, W2, ..., Wk, where S is used to mark the sentence, and Wk (k = 1,2,3 ...) is used to represent the kth word in the sentence.

則上述句子的可能值可以表示為：P(S) = P(W1, W2, … ,Wk) ~ P(W1)P(W2|W1)…P(Wk|W1,W2,…,Wk-1)，上述公式中的“P(W1)”、“P(W2|W1)”等概率是由上述語言模型訓練形成的。因此可以通過語言模型針對每個句子S處理得到其可能值P(S)，該可能值也可以視為該句子的語義學得分。Then the possible values of the above sentence can be expressed as: P (S) = P (W1, W2,…, Wk) ~ P (W1) P (W2 | W1)… P (Wk | W1, W2,…, Wk-1 ), The probabilities such as "P (W1)" and "P (W2 | W1)" in the above formula are formed by the above language model training. Therefore, the possible value P (S) of each sentence S can be obtained through the language model processing, and the possible value can also be regarded as the semantic score of the sentence.

最後在上述步驟S7中，選取前N個語義相近句子樣本並保留，隨後對被保留的語義相近句子樣本進行後續處理步驟，捨棄其他未被保留的語義相近句子樣本。上述N可以為自然數，並且其取值可以由使用者根據實際情況自由設定。Finally, in the above step S7, the first N semantically similar sentence samples are selected and retained, and then subsequent processing steps are performed on the retained semantically similar sentence samples, and other unreserved semantically similar sentence samples are discarded. The above N can be a natural number, and its value can be freely set by the user according to the actual situation.

具體地，針對上述步驟S7，本發明的一個較佳的實施例中，可以針對每個輸入的句子樣本均保留前N個語義相近句子樣本。本發明的另一個實施例中，還可以針對所有形成的語義相近句子樣本僅保留前N個。上述選取的對象範圍可以由使用者根據需要自行設定。Specifically, for step S7 described above, in a preferred embodiment of the present invention, the first N semantically similar sentence samples may be retained for each input sentence sample. In another embodiment of the present invention, only the first N may be retained for all formed semantically similar sentence samples. The selected object range can be set by the user as needed.

本發明的較佳的實施例中，上述輸入的句子樣本的類型包括：句子類型，句子類型的句子樣本中包括依序排列的多個詞語；句式類型，句式類型的句子樣本中包括依序排列的多個詞語和詞類標籤，或者僅包括多個依序排列的詞類標籤；則上述步驟S1具體如圖2所示，包括：步驟S11，獲取外部輸入的句子樣本；步驟S12，判斷句子樣本的類型：若句子樣本為句式類型，則轉向步驟S13；若句子樣本為句子類型，則直接轉向步驟S2；步驟S13，將句子樣本中的每個詞類標籤分別替換成對應於詞類標籤的一高頻詞，以形成完整的句子樣本，隨後轉向步驟S2。In a preferred embodiment of the present invention, the types of the input sentence samples include: sentence type, the sentence samples of the sentence type include a plurality of words arranged in order; sentence type, the sentence samples of the sentence type include according to A plurality of words and part-of-speech tags arranged in order, or only a plurality of parts-of-speech tags arranged in order; then the above step S1 is specifically shown in FIG. 2 and includes: Step S11, obtaining an externally input sentence sample; Step S12, judging a sentence Type of sample: If the sentence sample is a sentence type, go to step S13; if the sentence sample is a sentence type, go directly to step S2; Step S13, replace each part of speech tag in the sentence sample with a A high-frequency word is formed to form a complete sentence sample, and then the process proceeds to step S2.

具體地，本實施例中，上述句子樣本的類型可以包括句子類型和句式類型。Specifically, in this embodiment, the types of the sentence samples may include a sentence type and a sentence pattern type.

所謂句子類型，是指包括依序排列的多個詞語的句子，例如“我要聽周杰倫的青花瓷”就為一個句子。The so-called sentence type refers to a sentence that includes multiple words arranged in order. For example, "I want to listen to Jay Chou's blue and white porcelain" is a sentence.

所謂句式類型，是指包括依序排列的多個詞語和詞類標籤，或者僅包括依序排列的多個詞類標籤的句子，例如“我要聽‘歌手’的‘歌曲’”就為一個句式，其中“歌手”和“歌曲”均為詞類標簽。The so-called sentence type refers to a sentence that includes multiple words and part-of-speech tags arranged in order, or only a sentence that includes multiple part-of-speech tags arranged in order. For example, "I want to listen to 'song' of a singer" is a sentence Style, where "singer" and "song" are part of speech tags.

進一步地，只要在句子樣本中出現一個詞類標籤，該句子樣本就為一句式類型的句子樣本。例如“我要聽周杰倫的‘song’”就為一個句式類型的句子樣本。Further, as long as a part-of-speech tag appears in a sentence sample, the sentence sample is a sentence sample of a sentence type. For example, "I want to listen to Jay Chou's" song "is a sample sentence type.

則本實施例中，對於句子樣本無需做任何處理就能進入上述步驟S2中進行後續操作。Then, in this embodiment, the sentence sample can be entered into the foregoing step S2 for subsequent operations without any processing.

而對於句式樣本，需要將其中的詞類標籤替代成對應該標籤的詞語，以形成一個完整的句子，再送入上述步驟S2中進行後續處理。For the sentence pattern sample, the part-of-speech tags need to be replaced with words corresponding to the tags to form a complete sentence, and then sent to the above step S2 for subsequent processing.

具體地，上述步驟S13中，將被判斷為句式類型的句子樣本中的詞類標籤替代成該標籤中的高頻詞，以形成完整的句子樣本。所謂高頻詞，是指在由統計數據得到的在一個詞類標籤下出現次數較多、使用較為頻繁的詞語，採用這些高頻詞替代句式類型的句子樣本中的相應的詞類標籤，可以形成一個比較合理且完整的句子樣本。Specifically, in the above step S13, the part-of-speech tag in the sentence sample judged as a sentence type is replaced with a high-frequency word in the tag to form a complete sentence sample. The so-called high-frequency words refer to the words that appear frequently under a part-of-speech tag and are used frequently from statistical data. These high-frequency words can be used to replace the corresponding part-of-speech tags in sentence-type sentence samples. A reasonable and complete sample of the sentence.

本發明的較佳的實施例中，採用一預設的分詞方法預先訓練並形成詞向量模型；In a preferred embodiment of the present invention, a preset word segmentation method is used to pre-train and form a word vector model;

則上述步驟S2中，採用預設的分詞方法對句子樣本進行分詞處理。Then, in step S2 above, a preset word segmentation method is used to perform word segmentation processing on the sentence samples.

具體地，本實施例中，採用與訓練形成上述詞向量模型相同的分詞方法來對句子樣本進行分詞處理，能夠在後續的處理步驟中減少集合以外的詞，因此有助於提升最終的處理效果。Specifically, in this embodiment, the same word segmentation method as that used to form the above word vector model is used to perform word segmentation processing on a sentence sample, which can reduce words other than the set in subsequent processing steps, thereby helping to improve the final processing effect. .

本發明的一個較佳的實施例中，上述預設的分詞方法可以採用基於大詞典的正向最大匹配邏輯進行分詞的處理方法：從左向右取待切分的句子中的m個字符作為匹配字段，m為大詞典中最長詞語的詞長度；查找大詞典進行匹配，如果匹配成功，將匹配成功的字段作為一個詞切分出來；若匹配不成功，將匹配字段的最後一個字去掉，剩下的字符串作為新的匹配字段，進行再次匹配，重複上述過程，直至切分出所有詞為止。In a preferred embodiment of the present invention, the above-mentioned preset word segmentation method may adopt a processing method of word segmentation based on forward maximum matching logic of a large dictionary: taking m characters in a sentence to be segmented from left to right as Match field, m is the word length of the longest word in the large dictionary; find the large dictionary to match, if the match is successful, segment the successfully matched field as a word; if the match is unsuccessful, remove the last word of the matched field, The remaining string is used as a new matching field, and matching is performed again, and the above process is repeated until all words are segmented.

本發明的另一個較佳的實施例中，上述預設的分詞方法可以採用基於大詞典的逆向最大匹配邏輯進行分詞的處理方法，具體為：從右向左取待切分的句子的m個字符作為匹配字段，m為大詞典中最長詞語的詞長度; 查找大詞典進行匹配，如果匹配成功，將匹配成功的字段作為一個詞切分出來；若匹配不成功，將匹配字段的最前一個字去掉，剩下的字符串作為新的匹配字段，進行再次匹配，重複上述過程，直至切分出所有詞為止。In another preferred embodiment of the present invention, the above-mentioned preset word segmentation method may adopt a method for processing word segmentation based on a reverse maximum matching logic of a large dictionary, specifically: taking m m sentences to be segmented from right to left The character is used as the matching field, and m is the word length of the longest word in the large dictionary; the large dictionary is searched for matching. If the match is successful, the successfully matched field is split as a word; if the match is not successful, the first word of the matched field Remove the remaining string as a new matching field, and perform matching again. Repeat the above process until all words are segmented.

本發明的另一個較佳的實施例中，上述預設的分詞方法還可以採用基於大詞典的雙向最大匹配邏輯進行分詞的處理方法，即結合上述正向最大匹配邏輯和逆向最大匹配邏輯進行分詞處理的方法。具體為：若正向最大匹配和逆向最大匹配的結果相同，取任意一個的結果並輸出；若正向最大匹配和逆向最大匹配的結果不同，首先選擇分詞後的詞數較少的那個結果；如果詞數相同，選擇逆向最大匹配的結果。In another preferred embodiment of the present invention, the preset word segmentation method may also use a two-dimensional maximum matching logic based on a large dictionary for word segmentation processing, that is, the combination of the above forward maximum matching logic and the reverse maximum matching logic for word segmentation. Method of processing. Specifically: if the results of the forward maximum match and the reverse maximum match are the same, take any one of the results and output; if the results of the forward maximum match and the reverse maximum match are different, first select the result with fewer words after segmentation; If the number of words is the same, the result of the reverse maximum match is selected.

上述實施例中所謂的“大詞典”是指通過收集整理後形成的一個收錄大量詞語的詞典數據庫。The so-called "big dictionary" in the above embodiment refers to a dictionary database containing a large number of words formed after collection.

本發明的其他實施例中，其他分詞方法也可以適用於本發明中，並不影響本發明的保護範圍。In other embodiments of the present invention, other word segmentation methods may also be applied to the present invention, without affecting the protection scope of the present invention.

本發明的較佳的實施例中，上述步驟S4中，被選取並用於替換的相近詞與被替換的詞語之間具有相同的詞性，例如同樣為名詞或者同樣為動詞，因此可以保證替換操作的精準性，避免經過替換後的句子邏輯不合理。In a preferred embodiment of the present invention, in the above step S4, the similar words selected and used for replacement have the same part-of-speech between the replaced words, such as the same nouns or verbs, so the replacement operation can be guaranteed. Accuracy to avoid unreasonable sentence logic after replacement.

本發明的較佳的實施例中，上述語義相近句子樣本的類型包括：句子類型，句子類型的語義相近句子樣本中包括依序排列的多個詞語；句式類型，句式類型的語義相近句子樣本中包括依序排列的多個詞語和詞類標籤，或者僅包括多個依序排列的詞類標籤；則如圖3所示，上述步驟S7具體包括：步驟S71，選取並保留前N個語義相近句子樣本；步驟S72，判斷是否需要輸出句式類型的語義相近句子樣本：若是，則轉向步驟S73；若否，則轉向步驟S74；步驟S73，將語義相近句子樣本中包括的詞語替換成對應的詞類標籤，以形成完整的語義相近句子樣本，隨後進行後續處理步驟；步驟S74，根據被保留的語義相近句子樣本進行後續處理步驟。In a preferred embodiment of the present invention, the types of the semantically similar sentence samples include: a sentence type, and the semantically similar sentence samples of the sentence type include a plurality of words arranged in order; a sentence type, a sentence type semantically similar sentence The sample includes multiple words and part-of-speech tags arranged in order, or only multiple part-of-speech tags arranged in order; as shown in FIG. 3, the above step S7 specifically includes: Step S71, selecting and retaining the first N semantically similar Sentence sample; Step S72, determine whether it is necessary to output a sentence type semantically similar sentence sample: if yes, go to step S73; if not, go to step S74; step S73, replace the words included in the semantically similar sentence sample with corresponding ones Part-of-speech tags to form a complete sample of semantically similar sentences, followed by subsequent processing steps; step S74, subsequent processing steps are performed according to the retained sample of semantically similar sentences.

具體地，類似上文中，上述語義相近句子樣本同樣包括句子類型和句式類型。則在本實施例中，使用者可以自行設定最終輸出的語義相近句子樣本為句子類型還是句式類型：若使用者設定最終輸出的語義相近句子樣本為句子類型，則直接輸出通過語言模型篩選的語義相近句子樣本並進行後續處理步驟。Specifically, similar to the above, the above-mentioned semantically similar sentence samples also include a sentence type and a sentence type. Then, in this embodiment, the user can set the final semantically similar sentence samples as sentence type or sentence type. If the user sets the final semantically similar sentence samples as sentence type, the output is directly filtered by the language model. Semanticly similar sentence samples and subsequent processing steps.

若使用者設定最終輸出的語義相近句子樣本為句式類型，則需要將語義相近句子樣本中包括的詞語替換成對應的詞類標籤，以形成完整的句式類型的語義相近句子樣本，隨後再進行後續處理步驟。If the user sets the final semantically similar sentence sample as the sentence type, the words included in the semantically similar sentence sample need to be replaced with corresponding part-of-speech tags to form a complete sentence type semantically similar sentence sample. Follow-up processing steps.

本發明的較佳的實施例中，上文中的後續處理步驟，可以包括根據自動生成的大批量的語義相近句子樣本進行語義開放平臺的開發，或者進行語義相似度的計算等。In a preferred embodiment of the present invention, the subsequent processing steps mentioned above may include the development of a semantic open platform based on an automatically generated large number of semantically similar sentence samples, or calculation of semantic similarity.

具體地，本發明的較佳的實施例中，語義開放平臺的功能在於將語義的接口開放給其他開發者，幫助開發者完成具體項目的開發。當用戶輸入一個句子或者句式時，採用上文中的方法可以自動生成大量相似的句子或者句式，從而增加語義泛化能力，增強語義理解能力，並且降低了大量的人工操作，節省時間，提升效率。Specifically, in a preferred embodiment of the present invention, the function of the semantic open platform is to open the semantic interface to other developers to help developers complete the development of specific projects. When a user enters a sentence or a sentence pattern, the above method can automatically generate a large number of similar sentences or sentence patterns, thereby increasing the ability of semantic generalization, enhancing the ability of semantic understanding, and reducing a large number of manual operations, saving time and improving effectiveness.

相應地，本發明的較佳的實施例中，在語義相似度的計算過程中，需要使用到大量的語義相近的句子或者句式，則使用上文中的方法能夠大批量地生成用於語義相似度計算的訓練過程的句子樣本。Correspondingly, in the preferred embodiment of the present invention, in the process of calculating the semantic similarity, a large number of sentences or sentence patterns with similar semantics need to be used. Then, the method described above can be used to generate a large number of sentences for semantic similarity. Sentence samples for degree calculation training process.

本發明的較佳的實施例中，上述步驟S7中，最終可以輸出包括被保留的語義相近句子樣本的集合，以供後續進行處理。In a preferred embodiment of the present invention, in the above step S7, a set including the retained semantically similar sentence samples may be finally output for subsequent processing.

以上僅為本發明較佳的實施例，並非因此限制本發明的實施方式及保護範圍，對於本領域技術人員而言，應當能夠意識到凡運用本發明說明書及圖示內容所作出的等同替換和顯而易見的變化所得到的方案，均應當包含在本發明的保護範圍內。The above are only preferred embodiments of the present invention, and therefore do not limit the implementation and protection scope of the present invention. For those skilled in the art, they should be able to realize that equivalent substitutions and Obvious changes should be included in the protection scope of the present invention.

S1‧‧‧步驟S1S1‧‧‧Step S1

S2‧‧‧步驟S2S2‧‧‧Step S2

S3‧‧‧步驟S3S3‧‧‧Step S3

S4‧‧‧步驟S4S4‧‧‧Step S4

S5‧‧‧步驟S5S5‧‧‧Step S5

S6‧‧‧步驟S6S6‧‧‧Step S6

S7‧‧‧步驟S7S7‧‧‧Step S7

S11‧‧‧步驟S11S11‧‧‧Step S11

S12‧‧‧步驟S12S12‧‧‧Step S12

S13‧‧‧步驟S13S13‧‧‧Step S13

S71‧‧‧步驟S71S71‧‧‧Step S71

S72‧‧‧步驟S72S72‧‧‧Step S72

S73‧‧‧步驟S73S73‧‧‧Step S73

S74‧‧‧步驟S74S74‧‧‧Step S74

圖1是本發明的較佳的實施例中，一種自動生成語義相近句子樣本的方法的總體流程示意圖；圖2是本發明的較佳的實施例中，於圖1的基礎上，獲取外部輸入的句子樣本並進行處理的流程示意圖；圖3是本發明的較佳的實施例中，於圖1的基礎上，選取並保留語義相近句子樣本的同時對輸出的語義相近句子樣本進行處理的流程示意圖。FIG. 1 is a schematic diagram of an overall process of a method for automatically generating semantically similar sentence samples in a preferred embodiment of the present invention; FIG. 2 is a preferred embodiment of the present invention based on FIG. 1 to obtain external input Figure 3 is a schematic flow chart of processing and processing of a sample of a sentence; Figure 3 is a preferred embodiment of the present invention, on the basis of Figure 1, select and retain samples of sentences with similar semantics while processing the output of samples of sentences with similar semantics schematic diagram.

Claims

A method for automatically generating a sample of semantically similar sentences, which is applicable to the process of natural language processing; in which, a word vector model for processing and obtaining semantically similar words is pre-trained and formed, and a method for judging the generated sample of the semantically similar sentences is used. The language model of semantic possibility further includes: step S1, obtaining a sentence sample of external input; step S2, performing word segmentation processing on the sentence sample to decompose the sentence sample into a combination including a plurality of sequentially arranged words; step S3. Use the word vector model to obtain a set of similar words that are semantically similar to each of the words included in the sentence sample. Step S4, select one of the similar words from the set corresponding to each of the words. Word and replace the word to form a semantically similar sentence sample associated with the sentence sample; step S5, determine whether there are any similar words in the set that have not yet been selected: if so, return to step S4; step S6, Using the language model, for each semantically similar sentence sample, a The energy value, and sort all the semantically similar sentence samples according to the possible value from high to low; step S7, select and retain the first N semantically similar sentence samples to perform subsequent processing steps according to the retained semantically similar sentence samples .

The method for automatically generating a semantically similar sentence sample according to claim 1, wherein the types of the sentence samples include: a sentence type, and the sentence sample of the sentence type includes a plurality of the words arranged in sequence; a sentence type, The sentence sample of the sentence type includes a plurality of the words and a part of speech tag of the word in sequence, or the sentence sample of the sentence type includes a plurality of the word tags of the sentence in order; this step; S1 specifically includes: Step S11, obtaining an externally input sentence sample; Step S12, judging the type of the sentence sample: If the sentence sample is the sentence type, go to step S13; If the sentence sample is the sentence type, then Go directly to step S2; Step S13, replace each part-of-speech tag in the sentence sample with a high-frequency word corresponding to the part-of-speech tag to form a complete sample of the sentence, and then turn to step S2.

The method for automatically generating a sample of semantically similar sentences as described in claim 1, wherein a preset word segmentation method is used to pre-train and form the word vector model; then in step S2, the preset word segmentation method is adopted to the sentence. The sample is processed for word segmentation.

The method for automatically generating a sample of semantically similar sentences as described in claim 1, wherein, in step S4, the similar word selected and used for replacement has the same part of speech as the replaced word.

The method for automatically generating a semantically similar sentence sample as described in claim 1, wherein, in step S6, the possible value of each semantically similar sentence sample is used to indicate that each semantically similar sentence sample is taken as a complete sentence Semantic scoring of the possibility of establishment.

The method for automatically generating a semantically similar sentence sample according to claim 1, wherein the types of the semantically similar sentence sample include: a sentence type, and the semantically similar sentence sample of the sentence type includes a plurality of the words arranged in order; Sentence type, the semantically similar sentence sample of the sentence type includes a plurality of the words and a part of the part of speech tag of the word sequentially, or the sentence sample of the sentence type includes a plurality of the words arranged sequentially The part-of-speech tag; then step S7 specifically includes: step S71, selecting and retaining the first N samples of the semantically similar sentence; step S72, determining whether to output the sample of the semantically similar sentence of the sentence type: if yes, go to step S73 ; If not, go to step S74; step S73, replace the words included in the semantically similar sentence sample with the corresponding part-of-speech tags to form a complete sample of the semantically similar sentence, and then perform subsequent processing steps; step S74, according to The retained semantically similar sentence samples are subjected to subsequent processing steps.

The method for automatically generating samples of semantically similar sentences as described in claim 1, wherein in step S7, after selecting and retaining the first N samples of semantically similar sentences, outputting a sample set including the retained samples of semantically similar sentences, For subsequent processing steps.