WO2021159613A1 - Text semantic similarity analysis method and apparatus, and computer device - Google Patents

Text semantic similarity analysis method and apparatus, and computer device Download PDF

Info

Publication number
WO2021159613A1
WO2021159613A1 PCT/CN2020/087554 CN2020087554W WO2021159613A1 WO 2021159613 A1 WO2021159613 A1 WO 2021159613A1 CN 2020087554 W CN2020087554 W CN 2020087554W WO 2021159613 A1 WO2021159613 A1 WO 2021159613A1
Authority
WO
WIPO (PCT)
Prior art keywords
similarity
semantic similarity
recognition model
text
data set
Prior art date
Application number
PCT/CN2020/087554
Other languages
French (fr)
Chinese (zh)
Inventor
李小娟
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021159613A1 publication Critical patent/WO2021159613A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of natural language processing technology, and in particular to a method, device and computer equipment for analyzing text semantic similarity.
  • Semantic similarity calculation can also be called text matching.
  • Text matching is a common problem in many natural language processing applications.
  • Short text similarity refers to the calculation of similarity within a certain range of text length. Compared with long text, short text contains less information and has greater similarity calculations. The challenge.
  • the current short text similarity calculation method mainly adopts the deep learning method. The depth-based short text similarity calculation first needs to manually label a large amount of data, and then use the label data to calculate the similarity.
  • this application provides a text semantic similarity analysis method, device and computer equipment, which mainly solves the difficulties in obtaining and labeling short text similarity data when performing similarity analysis on short texts in the target field.
  • the effect of short text similarity algorithm is easily affected by the quality of data annotation, which leads to the problem of unstable analysis results.
  • a method for analyzing text semantic similarity includes:
  • the semantic similarity recognition result is determined based on the semantic similarity.
  • a text semantic similarity analysis device which includes:
  • the acquisition module is used to acquire general data sets and target domain data sets
  • the input module is used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
  • the determining module is used to determine the semantic similarity recognition result based on the semantic similarity.
  • a non-volatile readable storage medium on which a computer program is stored, and the program is executed by a processor to realize the above-mentioned text semantic similarity analysis method.
  • a computer device including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and running on the processor, When the processor executes the program, the method for analyzing the semantic similarity of the text is realized.
  • the application realizes the analysis effect in the improvement field, thereby also solving the problem of obtaining a large amount of training data in the target field.
  • FIG. 1 shows a schematic flowchart of a method for analyzing text semantic similarity provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of another method for analyzing text semantic similarity provided by an embodiment of the present application
  • FIG. 3 shows a schematic structural diagram of a text semantic similarity analysis device provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of another apparatus for analyzing text semantic similarity provided by an embodiment of the present application.
  • the embodiment of the present application provides a method for analyzing text semantic similarity. As shown in FIG. 1, the method includes:
  • the general data set can be: 400,000 short text similarity data sets obtained by ATEC2018 Ant Financial Short Text Semantic Similarity Competition, CCKS2018 WeBank Intelligent Customer Service Question Matching Competition, Harbin Institute of Technology's data set LCQMC and other methods.
  • the target field data set can be historical data records in the target field, search engines and other accumulated data.
  • algorithms can be developed to maximize the use of labeled domain knowledge to assist knowledge acquisition and learning in the target domain.
  • the core is to find the similarities between the source domain and the target domain, and make rational use of them. This similarity is very common.
  • the model used to recognize cars can be used to improve the ability to recognize karts, and transfer learning can store and use prior knowledge of other different but related problems.
  • the similarity recognition model can be applied to the short text similarity detection in the target field, and the corresponding similarity is output according to the input short text pair.
  • the similarity recognition result corresponding to the semantic similarity can be determined by setting the similarity threshold.
  • the idea of transfer learning can be used to learn a short text similarity analysis method in a general field through a large number of existing public data sets. Then only need to label an appropriate amount of data in the target field, use this labeled data for refined learning, and realize the short text similarity analysis in the target field.
  • this method can not only learn the semantic information of the short text similarity of general data, but also apply this prior knowledge in a targeted manner In the short text similarity analysis of the target field, the analysis effect in the improved field is realized, and the problem of obtaining a large amount of training data in the target field is also solved.
  • the method includes :
  • a general data set can be used instead in the pre-training process, and then the acquired target field data set can be used to further modify the training. Therefore, in this application, a large number of general data sets need to be obtained in advance and collected as much as possible To a predetermined number of target field data sets that can meet the revised standards.
  • Two short texts are arbitrarily selected from the general data set to form a text pair to be tested.
  • short texts can be randomly selected from a general data set to form a text pair to be tested, which is used for multiple and comprehensive training of the semantic similarity recognition model.
  • the text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence.
  • the first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested.
  • the second sequence corresponds to the mapping result of another short text in the text pair to be tested.
  • BiLSTM can learn the word in a sentence and its context to obtain a new Embedding vector.
  • the first vector can be calculated by the formula And the second vector
  • the first vector can be obtained based on the step 204 of the embodiment And the second vector And calculate the difference between the first vector and the second vector, where the attention model can be applied.
  • the calculation method of attention weight is:
  • the third sequence and the fourth sequence are respectively subtracted and multiplied, and the first sequence obtained above is And the second sequence for splicing operation, get Then send the obtained value to BiLSTM again, where BiLSTM is mainly to capture local inference information m a and m b and context information.
  • Input v a and v b into the pooling layer in turn.
  • the softmax output layer can be used, the output category is 2 types, and the output value is a number ranging from 0 to 1, that is, the similarity value.
  • the first similarity recognition result is further determined according to the similarity value, where the closer the similarity value is to 1, the more similar the two input sentences are; otherwise, the less similar they are.
  • the first target recognition result can be obtained in advance according to the marks in the text pair to be tested. After the first similarity recognition result is obtained, the first similarity recognition result can be compared with the first target recognition result. Perform matching, and further determine the first accuracy loss based on the similarity between the two.
  • the loss function of the training process is softmaxwithloss
  • the learning rate can be initially 1e-3
  • the learning rate is set to dynamically attenuate with training. After the training converges, the similarity recognition model is saved.
  • step 210 of the embodiment may specifically include: if it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modifying The output category of the softmax layer in the semantic similarity recognition model; if it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is frozen The initial layer of, retrain the remaining layers; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, then the semantic similarity is retrained using the target domain dataset Recognition model; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, the architecture and initial weights of the semantic similarity recognition model are
  • this application is applicable to situations where the amount of data is small but the data similarity is high, and the softmax output layer is the same.
  • the positive training samples can be labeled by user clicks and other behaviors.
  • different query commands can be treated as similar questions.
  • step 212 of the embodiment may specifically include: randomly selecting two short text sentences from the target domain data set to construct a sample sentence pair, and comparing the sample sentence pair based on the Jeckard similarity measurement method Perform similarity calculation to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
  • J(A, B) is the similarity calculation result
  • A is a short text sentence in the sample sentence pair
  • B is another short text sentence in the sample sentence pair.
  • the positive training samples and negative training samples can be input into the adjusted semantic similarity recognition model, and the semantic similarity recognition model can be further trained and revised to obtain the corresponding second similarity recognition result.
  • the second target recognition result can be obtained in advance according to the marks in the positive training sample and the negative training sample. After the second similarity recognition result is obtained, the second similarity recognition result can be compared with The second target recognition result is matched, and the second accuracy loss is further determined according to the similarity between the two.
  • the loss function of the training process is softmaxwithloss
  • the learning rate can be initially 1e-4
  • the learning rate is set to dynamically attenuate with training, the training converges and when the recognition accuracy is greater than or equal to the recognition accuracy set in the preset standard , Save the semantic similarity recognition model.
  • the two target short texts to be recognized for semantic similarity can be input into the semantic similarity recognition model to obtain the difference between the two target short texts. Similarity.
  • step 217 of the embodiment may specifically include: comparing the similarity value with the fourth preset threshold and the fifth preset threshold; if it is determined that the similarity value is less than the fourth preset threshold , The semantic similarity recognition result is determined to be dissimilar; if the similarity value is determined to be greater than or equal to the fourth preset threshold and less than the fifth preset threshold, the semantic similarity recognition result is determined to be moderately similar; if the similarity value is determined If it is greater than or equal to the fifth preset threshold, it is determined that the semantic similarity recognition result is highly similar; and the similarity recognition result is output.
  • the method of determining the semantic similarity recognition result according to the similarity value is not limited to the above-mentioned case, and can also include multiple implementation methods. For example, only one preset threshold can be set. When the degree value is greater than the preset threshold, the semantic similarity recognition result is judged to be similar, otherwise it is judged to be dissimilar.
  • the data of the labeled field can be used to the maximum to train the semantic similarity recognition model, and then the semantic similarity recognition model is applied to the target field based on the idea of transfer learning, and only the appropriate amount of labeling is required.
  • this method can not only learn the semantic information of the short text similarity of general data, but also can target this priori Knowledge is applied to the calculation of short text similarity in the target field to improve the calculation effect in the field, which also solves the problem of obtaining a large amount of training data in the target field, and improves the accuracy and work efficiency of semantic similarity calculation.
  • an embodiment of the present application provides a text semantic similarity analysis device.
  • the device includes: an acquisition module 31, a training module 32, The adjustment module 33, the input module 34, and the determination module 35.
  • the obtaining module 31 can be used to obtain a general data set and a target field data set
  • the training module 32 can be used to train a semantic similarity recognition model using a general data set as a training sample
  • the adjustment module 33 can be used to adjust the semantic similarity recognition model by using the target domain data set as the transfer learning sample;
  • the input module 34 can be used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
  • the determining module 35 can be used to determine the semantic similarity recognition result based on the semantic similarity.
  • the training module 32 can be specifically used to arbitrarily filter out two short texts from the general data set to form a text pair to be tested;
  • the text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence, and the first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested
  • the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;
  • the first sequence and the second sequence are input into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first sequence A vector and a second vector; calculate the difference between the first vector and the second vector, and obtain the weighted third sequence corresponding to the first vector and the weighted second vector
  • the fourth sequence the feature vector is calculated according to the first sequence, the second sequence, the third sequence, and the fourth sequence; the first similarity recognition result is output
  • the adjustment module 33 can be specifically used to adjust the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity; Constructing positive training samples based on historical data records in the target field data set; screening negative training samples based on the Jackard similarity measurement method; inputting the positive training samples and the negative training samples to the adjusted semantics
  • the similarity recognition model the second similarity recognition result is obtained; the second accuracy loss of the second similarity recognition result relative to the second target recognition result is determined; the second loss is determined based on the second accuracy loss Function, using the adjusted semantic similarity recognition model of the second loss function to optimize, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
  • the adjustment module 33 can be specifically used to determine if the data volume of the target field data set is less than or equal to the first A preset threshold and the text similarity is greater than a second preset threshold, then the output category of the softmax layer in the semantic similarity recognition model is modified; if it is determined that the data volume of the target domain data set is less than or equal to the first preset Set a threshold and the text similarity is less than or equal to the second preset threshold, then freeze the initial layer in the semantic similarity recognition model, and retrain the remaining layers; if the data of the target domain data set is determined If the amount is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is retrained using the target domain data set; if the target domain is determined If the data amount of the data set is greater than the first preset threshold, and the
  • the adjustment module 33 can be specifically used to randomly select two short text sentences from the target field data set to construct sample sentence pairs, based on the Jaccard similarity
  • the measurement method performs similarity calculation on the sample sentence pairs to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative training sample.
  • J(A, B) is the similarity calculation result
  • A is a short text sentence in the sample sentence pair
  • B is another short text sentence in the sample sentence pair.
  • the determining module 35 may be specifically configured to compare the similarity value with the fourth preset threshold and the fifth preset threshold; if If it is determined that the similarity value is less than the fourth preset threshold, it is determined that the semantic similarity recognition result is dissimilar; if it is determined that the similarity value is greater than or equal to the fourth preset threshold and less than the first Five preset thresholds, determine that the semantic similarity recognition result is moderately similar; if it is determined that the similarity value is greater than or equal to the fifth preset threshold, determine that the semantic similarity recognition result is highly similar;
  • the device in order to display the semantic similarity recognition result on the display page, as shown in FIG. 4, the device further includes: an output module 36.
  • the output module 36 is used to output the similarity recognition result.
  • an embodiment of the present application also provides a storage medium on which a computer program is stored.
  • the storage medium may be non-volatile or volatile.
  • the technical solution of the present application can be embodied in the form of a software product.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
  • the embodiments of the present application also provide a computer device, which may be a personal computer, Servers, network devices, etc.
  • the physical device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to realize the semantic similarity of the text as shown in FIG. 1 and FIG. Analytical method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of computers. Disclosed in the present application are a text semantic similarity analysis method and apparatus, and a computer device, which can solve the problems that when similarity analysis is carried out on short text in a target domain, short text similarity data is difficult to obtain and label, a short text similarity algorithm effect is easily affected by the data labeling quality, so that a calculation result is unstable. The method comprises: obtaining a universal data set and a target domain data set; training a semantic similarity recognition model by taking the universal data set as a training sample; adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample; inputting target short text to be subjected to semantic similarity recognition into the adjusted semantic similarity recognition model to obtain semantic similarity; and determining a semantic similarity recognition result on the basis of the semantic similarity. The present application is suitable for analyzing text semantic similarity in a target domain.

Description

文本语义相似度的分析方法、装置及计算机设备Analysis method, device and computer equipment of text semantic similarity
本申请要求于2020年2月14日提交中国专利局、申请号为202010092595.3,申请名称为“文本语义相似度的分析方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 14, 2020, with the application number of 202010092595.3, and the application titled "Analysis Method, Apparatus and Computer Equipment for Text Semantic Similarity", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及自然语言处理技术领域,尤其涉及到一种文本语义相似度的分析方法、装置及计算机设备。This application relates to the field of natural language processing technology, and in particular to a method, device and computer equipment for analyzing text semantic similarity.
背景技术Background technique
语义相似度计算也可以称作文本匹配。文本匹配是很多自然语言处理应用中常见的问题,短文本相似度是指文本长度在一定范围的相似度计算,相比长文本,短文本包含的信息更少,在相似度计算上具有更大的挑战性。目前的短文本相似度计算方法主要采用深度学习的方法,基于深度的短文本相似度计算首先需要人工标注大量的数据,进而利用标注数据进行相似度计算。Semantic similarity calculation can also be called text matching. Text matching is a common problem in many natural language processing applications. Short text similarity refers to the calculation of similarity within a certain range of text length. Compared with long text, short text contains less information and has greater similarity calculations. The challenge. The current short text similarity calculation method mainly adopts the deep learning method. The depth-based short text similarity calculation first needs to manually label a large amount of data, and then use the label data to calculate the similarity.
然而发明人发现,现有的基于特定领域的短文本相似度计算,如果该领域的公开数据较少,存在短文本相似度数据的获取以及标注困难的问题,且短文本相似度算法效果容易受数据标注质量的影响,导致计算结果不稳定。However, the inventor found that the existing short text similarity calculation based on a specific field, if there is less public data in this field, there are problems in obtaining short text similarity data and labeling, and the effect of the short text similarity algorithm is easily affected. The influence of the quality of data labeling leads to unstable calculation results.
发明内容Summary of the invention
有鉴于此,本申请提供了一种文本语义相似度的分析方法、装置及计算机设备,主要解决在对目标领域的短文本进行相似度分析时,存在短文本相似度数据的获取以及标注困难,且短文本相似度算法效果容易受数据标注质量的影响,导致分析结果不稳定的问题。In view of this, this application provides a text semantic similarity analysis method, device and computer equipment, which mainly solves the difficulties in obtaining and labeling short text similarity data when performing similarity analysis on short texts in the target field. In addition, the effect of short text similarity algorithm is easily affected by the quality of data annotation, which leads to the problem of unstable analysis results.
根据本申请的一个方面,提供了一种文本语义相似度的分析方法,该方法包括:According to one aspect of the present application, a method for analyzing text semantic similarity is provided. The method includes:
获取通用数据集以及目标领域数据集;Obtain general data sets and target domain data sets;
将所述通用数据集作为训练样本训练语义相似度识别模型;Training a semantic similarity recognition model using the general data set as a training sample;
利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;
将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
基于所述语义相似度确定语义相似度识别结果。The semantic similarity recognition result is determined based on the semantic similarity.
根据本申请的另一个方面,提供了一种文本语义相似度的分析装置,该装置包括:According to another aspect of the present application, there is provided a text semantic similarity analysis device, which includes:
获取模块,用于获取通用数据集以及目标领域数据集;The acquisition module is used to acquire general data sets and target domain data sets;
训练模块,用于将所述通用数据集作为训练样本训练语义相似度识别模型;A training module for training a semantic similarity recognition model using the general data set as a training sample;
调整模块,用于利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;An adjustment module for adjusting the semantic similarity recognition model by using the target domain data set as a migration learning sample;
输入模块,用于将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;The input module is used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
确定模块,用于基于所述语义相似度确定语义相似度识别结果。The determining module is used to determine the semantic similarity recognition result based on the semantic similarity.
根据本申请的另一个方面,提供了一种非易失性可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现上述文本语义相似度的分析方法。According to another aspect of the present application, a non-volatile readable storage medium is provided, on which a computer program is stored, and the program is executed by a processor to realize the above-mentioned text semantic similarity analysis method.
根据本申请的再一个方面,提供了一种计算机设备,包括非易失性可读存储介质、处理器及存储在非易失性可读存储介质上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述文本语义相似度的分析方法。According to another aspect of the present application, a computer device is provided, including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and running on the processor, When the processor executes the program, the method for analyzing the semantic similarity of the text is realized.
借由上述技术方案,本申请实现提升领域内的分析效果,从而也解决了目标领域获取大量训练数据的难题。With the above technical solutions, the application realizes the analysis effect in the improvement field, thereby also solving the problem of obtaining a large amount of training data in the target field.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本地申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the local application. In the attached picture:
图1示出了本申请实施例提供的一种文本语义相似度的分析方法的流程示意图;FIG. 1 shows a schematic flowchart of a method for analyzing text semantic similarity provided by an embodiment of the present application;
图2示出了本申请实施例提供的另一种文本语义相似度的分析方法的流程示意图;FIG. 2 shows a schematic flowchart of another method for analyzing text semantic similarity provided by an embodiment of the present application;
图3示出了本申请实施例提供的一种文本语义相似度的分析装置的结构示意图;FIG. 3 shows a schematic structural diagram of a text semantic similarity analysis device provided by an embodiment of the present application;
图4示出了本申请实施例提供的另一种文本语义相似度的分析装置的结构示意图。FIG. 4 shows a schematic structural diagram of another apparatus for analyzing text semantic similarity provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种文本语义相似度的分析方法,如图1所示,该 方法包括:The embodiment of the present application provides a method for analyzing text semantic similarity. As shown in FIG. 1, the method includes:
101、获取通用数据集以及目标领域数据集。101. Obtain general data sets and target domain data sets.
其中,通用数据集可为:由ATEC2018蚂蚁金服短文本语义相似度竞赛,CCKS2018微众银行智能客服问句匹配大赛,哈工大整理的数据集LCQMC等方式获取到的40万短文本相似度数据集;目标领域数据集可为目标领域内的历史数据记录、搜索引擎等积累数据等。Among them, the general data set can be: 400,000 short text similarity data sets obtained by ATEC2018 Ant Financial Short Text Semantic Similarity Competition, CCKS2018 WeBank Intelligent Customer Service Question Matching Competition, Harbin Institute of Technology's data set LCQMC and other methods. ; The target field data set can be historical data records in the target field, search engines and other accumulated data.
102、将通用数据集作为训练样本训练语义相似度识别模型。102. Use the general data set as a training sample to train a semantic similarity recognition model.
在具体的应用场景中,计算相似度需要标注出两句话的相似与否,且数据量不能太小,要求有一定的普适性,这对标注人员来说是一项艰巨的工作。也因为如此,短文本相似度计算一直是一项值得研究的课题。在本申请中,可选用数据量较大的通用数据集作为训练样本初步训练语义相似度识别模型。In specific application scenarios, calculating similarity needs to indicate whether the two sentences are similar or not, and the amount of data cannot be too small, requiring a certain degree of universality, which is an arduous task for annotators. Because of this, short text similarity calculation has always been a topic worthy of research. In this application, a general data set with a large amount of data can be selected as the training sample to initially train the semantic similarity recognition model.
103、利用目标领域数据集作为迁移学习样本调整语义相似度识别模型。103. Use the target domain data set as a transfer learning sample to adjust the semantic similarity recognition model.
在具体的应用场景中,可开发算法来最大限度地利用有标注的领域的知识,来辅助目标领域的知识获取和学习。其核心是找到源领域和目标领域之间的相似性,并加以合理利用。这种相似性非常普遍,例如,用来辨识汽车的模型,可以被用来提升识别卡丁车的能力,迁移学习可以存储和利用其他不同但相关的问题的先验知识。In specific application scenarios, algorithms can be developed to maximize the use of labeled domain knowledge to assist knowledge acquisition and learning in the target domain. The core is to find the similarities between the source domain and the target domain, and make rational use of them. This similarity is very common. For example, the model used to recognize cars can be used to improve the ability to recognize karts, and transfer learning can store and use prior knowledge of other different but related problems.
104、将待进行语义相似度识别的目标短文本输入至调整完成的语义相似度识别模型中,获取得到语义相似度。104. Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model, and obtain the semantic similarity.
在具体的应用场景中,在完成对相似度识别模型的调整后,可将相似度识别模型应用到目标领域的短文本相似度检测中,根据输入的短文本对,输出对应的相似度。In a specific application scenario, after the similarity recognition model is adjusted, the similarity recognition model can be applied to the short text similarity detection in the target field, and the corresponding similarity is output according to the input short text pair.
105、基于语义相似度确定语义相似度识别结果。105. Determine the semantic similarity recognition result based on the semantic similarity.
相应的,可通过设定相似度阈值的方式来确定语义相似度对应的相似度识别结果。Correspondingly, the similarity recognition result corresponding to the semantic similarity can be determined by setting the similarity threshold.
通过本实施例中文本语义相似度的分析方法,可使用迁移学习的思想,通过大量的已有的公开数据集,学习一个通用领域的短文本相似度分析方法。然后只需要标注适量的目标领域内的数据,利用这标注数据进行精细化学习,实现目标领域的短文本相似度分析。相比直接使用通用数据或金融数据,或者通用数据与金融数据的混合,这种方式既能学习到通用数据的短文本相似 度的语义信息,又能有针对性地将这种先验知识应用到目标领域的短文本相似度分析中,实现提升领域内的分析效果,从而也解决了目标领域获取大量训练数据的难题。Through the analysis method of text semantic similarity in this embodiment, the idea of transfer learning can be used to learn a short text similarity analysis method in a general field through a large number of existing public data sets. Then only need to label an appropriate amount of data in the target field, use this labeled data for refined learning, and realize the short text similarity analysis in the target field. Compared with the direct use of general data or financial data, or a mixture of general data and financial data, this method can not only learn the semantic information of the short text similarity of general data, but also apply this prior knowledge in a targeted manner In the short text similarity analysis of the target field, the analysis effect in the improved field is realized, and the problem of obtaining a large amount of training data in the target field is also solved.
进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本实施例中的具体实施过程,提供了另一种文本语义相似度的分析方法,如图2所示,该方法包括:Further, as a refinement and extension of the specific implementation of the foregoing embodiment, in order to fully illustrate the specific implementation process in this embodiment, another method for analyzing text semantic similarity is provided. As shown in FIG. 2, the method includes :
201、获取通用数据集以及目标领域数据集。201. Obtain general data sets and target domain data sets.
对于本实施例,在具体的应用场景中,由于基于深度的短文本相似度需要大量的人工标注数据,但基于目标领域的数据却很少,导致短文本相似度在目标领域内的分析效果不够理想,故在前期训练过程中可利用通用数据集来替代,之后再利用获取到的目标领域数据集进一步修正训练,故在本申请中,需要预先获取大量的通用数据集,并尽可能地收集到能够满足修正标准的预定数量个目标领域数据集。For this embodiment, in a specific application scenario, since the depth-based short text similarity requires a large amount of manual annotation data, but there are few data based on the target field, the analysis effect of the short text similarity in the target field is insufficient. Ideally, a general data set can be used instead in the pre-training process, and then the acquired target field data set can be used to further modify the training. Therefore, in this application, a large number of general data sets need to be obtained in advance and collected as much as possible To a predetermined number of target field data sets that can meet the revised standards.
202、从通用数据集中任意筛选出两个短文本构成待测文本对。202. Two short texts are arbitrarily selected from the general data set to form a text pair to be tested.
对于本实施例,在具体的应用场景中,为了保证训练的准确性,可从通用数据集中随机抽取短文本构成待测文本对,用于多次、全面的训练语义相似度识别模型。For this embodiment, in a specific application scenario, in order to ensure the accuracy of training, short texts can be randomly selected from a general data set to form a text pair to be tested, which is used for multiple and comprehensive training of the semantic similarity recognition model.
203、对待测文本对进行预处理并输入至语义相似度识别模型中的Embedding层,获取得到第一序列和第二序列,第一序列对应待测文本对中其中一个短文本的映射结果,第二序列对应待测文本对中另一个短文本的映射结果。203. The text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence. The first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested. The second sequence corresponds to the mapping result of another short text in the text pair to be tested.
例如,输入A、B两个句子,经过预处理及Embedding层映射即可得到第一序列a=(a1…a la)和第二序列b=(b1…b lb),其中,ai,bj∈Rl为Embedding层输出的l维向量。 For example, input two sentences A and B, after preprocessing and Embedding layer mapping, the first sequence a=(a1...a la ) and the second sequence b=(b1...b lb ), where ai, bj∈ Rl is the l-dimensional vector output by the Embedding layer.
204、将第一序列和第二系列输入至双向长短时记忆网络BiLSTM中,以便获取得到对应的第一向量以及第二向量。204. Input the first sequence and the second sequence into the bidirectional long and short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector.
例如,将实施例步骤203中获取得到的第一序列和第二序列输入至双向长短时记忆网络BiLSTM中,BiLSTM可以学习一句话中的词和它的上下文关系,得到新的Embedding向量。即:For example, input the first sequence and the second sequence obtained in step 203 of the embodiment into the bidirectional long-short-term memory network BiLSTM. BiLSTM can learn the word in a sentence and its context to obtain a new Embedding vector. which is:
Figure PCTCN2020087554-appb-000001
Figure PCTCN2020087554-appb-000001
Figure PCTCN2020087554-appb-000002
Figure PCTCN2020087554-appb-000002
其中
Figure PCTCN2020087554-appb-000003
表示a在BiLSTM网络中第i个时间步的输出,
Figure PCTCN2020087554-appb-000004
表示b在BiLSTM网络中第i个时间步的输出。
in
Figure PCTCN2020087554-appb-000003
Represents the output of a at the i-th time step in the BiLSTM network,
Figure PCTCN2020087554-appb-000004
Represents the output of b at the i-th time step in the BiLSTM network.
通过公式可计算出第一向量
Figure PCTCN2020087554-appb-000005
和第二向量
Figure PCTCN2020087554-appb-000006
The first vector can be calculated by the formula
Figure PCTCN2020087554-appb-000005
And the second vector
Figure PCTCN2020087554-appb-000006
205、计算第一向量及第二向量之间的差异性,并获取得到第一向量对应加权后的第三序列和第二向量对应加权后的第四序列。205. Calculate the difference between the first vector and the second vector, and obtain a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector.
例如,基于实施例步骤204可获取得到第一向量
Figure PCTCN2020087554-appb-000007
和第二向量
Figure PCTCN2020087554-appb-000008
并计算第一向量和第二向量之间的差异性,这里可应用注意力attention模型。其中attention weight的计算方式为:
For example, the first vector can be obtained based on the step 204 of the embodiment
Figure PCTCN2020087554-appb-000007
And the second vector
Figure PCTCN2020087554-appb-000008
And calculate the difference between the first vector and the second vector, where the attention model can be applied. The calculation method of attention weight is:
Figure PCTCN2020087554-appb-000009
Figure PCTCN2020087554-appb-000009
之后基于上述attention weight分别计算a和b的权重加权后的值,即:Then calculate the weighted values of a and b based on the above attention weight, namely:
Figure PCTCN2020087554-appb-000010
Figure PCTCN2020087554-appb-000010
Figure PCTCN2020087554-appb-000011
Figure PCTCN2020087554-appb-000011
其中,
Figure PCTCN2020087554-appb-000012
为第三序列,
Figure PCTCN2020087554-appb-000013
为第四序列。
in,
Figure PCTCN2020087554-appb-000012
Is the third sequence,
Figure PCTCN2020087554-appb-000013
For the fourth sequence.
206、根据第一序列、第二序列、第三序列和第四序列计算得到特征向量。206. Calculate the feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence.
在具体的应用场景中,为了充分获取两个句子之间的差异信息及句子交互信息,分别对第三序列和第四序列进行对位相减与对位相乘,并将前述得到的第一序列和第二序列进行拼接操作,得到
Figure PCTCN2020087554-appb-000014
Figure PCTCN2020087554-appb-000015
之后将得到的值再一次送到BiLSTM中,这里的BiLSTM主要是为了捕获局部推理信息
Figure PCTCN2020087554-appb-000016
m a和m b及其上下文信息。将v a和v b依次输入池化层,池化层包括最大池化层以及平均池化层,之后将池化后的结果再一次拼接起来,得到特征向量V={V a,vue,V a,max,V b,vue,V b,max}。
In a specific application scenario, in order to fully obtain the difference information between the two sentences and the sentence interaction information, the third sequence and the fourth sequence are respectively subtracted and multiplied, and the first sequence obtained above is And the second sequence for splicing operation, get
Figure PCTCN2020087554-appb-000014
Figure PCTCN2020087554-appb-000015
Then send the obtained value to BiLSTM again, where BiLSTM is mainly to capture local inference information
Figure PCTCN2020087554-appb-000016
m a and m b and context information. Input v a and v b into the pooling layer in turn. The pooling layer includes the maximum pooling layer and the average pooling layer. After that, the pooled results are stitched together again to obtain the feature vector V={V a,vue ,V a,max ,V b,vue ,V b,max }.
207、基于特征向量输出第一相似度识别结果。207. Output the first similarity recognition result based on the feature vector.
相应的,在获取到特征向量后,可通过softmax输出层,输出类别为2类,输出值为范围为0到1之间的数,即相似度值。进一步根据相似度值确定出第一相似度识别结果,其中,相似度值越接近1表示输入的两句话越相似,否则,则越不相似。Correspondingly, after the feature vector is obtained, the softmax output layer can be used, the output category is 2 types, and the output value is a number ranging from 0 to 1, that is, the similarity value. The first similarity recognition result is further determined according to the similarity value, where the closer the similarity value is to 1, the more similar the two input sentences are; otherwise, the less similar they are.
208、确定第一相似度识别结果相对于第一目标识别结果的第一准确度损失。208. Determine the first accuracy loss of the first similarity recognition result relative to the first target recognition result.
在具体的应用场景中,可预先根据待测文本对中的标记获取得到第一目标识别结果,在获取得到第一相似度识别结果后,可将第一相似度识别结果与第一目标识别结果进行匹配,根据两者之间的相似度进一步确定出第一准确度损失。In specific application scenarios, the first target recognition result can be obtained in advance according to the marks in the text pair to be tested. After the first similarity recognition result is obtained, the first similarity recognition result can be compared with the first target recognition result. Perform matching, and further determine the first accuracy loss based on the similarity between the two.
209、基于第一准确度损失确定第一损失函数,利用第一损失函数对语义相似度识别模型进行优化。209. Determine a first loss function based on the first accuracy loss, and use the first loss function to optimize the semantic similarity recognition model.
对于本实施例,训练过程损失函数是softmaxwithloss,学习率learningrate可初始为1e-3,设置随着训练动态衰减学习率,训练收敛后,保存相似度识别模型。For this embodiment, the loss function of the training process is softmaxwithloss, the learning rate can be initially 1e-3, and the learning rate is set to dynamically attenuate with training. After the training converges, the similarity recognition model is saved.
210、根据目标领域数据集的数据量以及文本相似度的大小调整语义相似度识别模型。210. Adjust the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity.
对于本实施例,在具体的应用场景中,实施例步骤210具体可以包括:若确定目标领域数据集的数据量小于或等于第一预设阈值、文本相似度大于第二预设阈值,则修改语义相似度识别模型中softmax层的输出类别;若确定目标领域数据集的数据量小于或等于第一预设阈值、文本相似度小于或等于第二预设阈值,则冻结语义相似度识别模型中的初始层,再次训练余下的各层;若确定目标领域数据集的数据量大于第一预设阈值、文本相似度小于或等于第二预设阈值,则利用目标领域数据集重新训练语义相似度识别模型;若确定目标领域数据集的数据量大于第一预设阈值、文本相似度大于第二预设阈值,则保留语义相似度识别模型的体系结构和初始权重,并利用初始权重来重新训练语义相似度识别模型。For this embodiment, in a specific application scenario, step 210 of the embodiment may specifically include: if it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modifying The output category of the softmax layer in the semantic similarity recognition model; if it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is frozen The initial layer of, retrain the remaining layers; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, then the semantic similarity is retrained using the target domain dataset Recognition model; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, the architecture and initial weights of the semantic similarity recognition model are retained, and the initial weights are used to retrain Semantic similarity recognition model.
在具体的应用场景中,本申请可适用于数据量少,但数据相似度很高的情况,且softmax输出层是相同的。微调阶段,可以直接使用预训练的模型权重,使用较小的学习率来继续训练网路(例如1e-4),得到最终的相似度检测模型。In specific application scenarios, this application is applicable to situations where the amount of data is small but the data similarity is high, and the softmax output layer is the same. In the fine-tuning stage, you can directly use the pre-trained model weights and use a smaller learning rate to continue training the network (such as 1e-4) to obtain the final similarity detection model.
211、利用目标领域数据集中的历史数据记录构建正例训练样本。211. Use historical data records in the target domain data set to construct positive training samples.
对于本实施例,在具体的应用场景中,正例训练样本可以通过用户点击等行为来指导标注,例如,对于相同的搜索点击行为,可以将不同的查询命 令query作为相似问题。For this embodiment, in a specific application scenario, the positive training samples can be labeled by user clicks and other behaviors. For example, for the same search click behavior, different query commands can be treated as similar questions.
212、基于杰卡德相似性度量方法筛选负例训练样本。212. Screen negative training samples based on the Jeckard similarity measurement method.
对于本实施例,在具体的应用场景中,实施例步骤212具体可以包括:从目标领域数据集中随机筛选出两个短文本句子构建样本句子对,基于杰卡德相似性度量方法对样本句子对进行相似度计算,获取得到相似度计算结果;若相似度计算结果大于第三预设阈值,则将对应的样本句子对确定为负例训练样本。For this embodiment, in a specific application scenario, step 212 of the embodiment may specifically include: randomly selecting two short text sentences from the target domain data set to construct a sample sentence pair, and comparing the sample sentence pair based on the Jeckard similarity measurement method Perform similarity calculation to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
其中,杰卡德相似性度量方法的计算公式为:
Figure PCTCN2020087554-appb-000017
其中,J(A,B)为相似度计算结果,A为样本句子对中的一个短文本句子,B为样本句子对中的另一个短文本句子。
Among them, the calculation formula of Jaccard's similarity measurement method is:
Figure PCTCN2020087554-appb-000017
Among them, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
在具体的应用场景中,在构造负例训练样本时,为了筛选出大量完全不相关的两句话作为负例训练样本,故需要预先对随机筛选出的两两组合句子进行相似度计算。对不满足相似度阈值的数据进行过滤。同时也保留一部分相似度阈值低的句子对,以保证数据的多样性。这里的相似度只需要保证字面意思是否相近。In a specific application scenario, when constructing negative training samples, in order to screen out a large number of completely unrelated two sentences as negative training samples, it is necessary to calculate the similarity of the randomly selected paired sentences in advance. Filter data that does not meet the similarity threshold. At the same time, some sentence pairs with a low similarity threshold are also retained to ensure the diversity of the data. The similarity here only needs to ensure that the literal meaning is similar.
例如,句子1:你是哪个公司的,找我干嘛?、句子2:你是哪个公司的,我不是你说的那个人。去除句子1和句子2中的标点符号即可分别转为集合A={你,是,哪,个,公,司,的,找,我,干,嘛},B={你,是,哪,个,公,司,的,我,不,是,你,说,的,这,个,人},获取得到并集A∪B:{你,是,哪,个,公,司,的,找,我,干,嘛,不,说,那,个,人},获取得到交集A∩B:{你,是,哪,个,公,司,的,我},进一步可计算出杰卡德系数为:交集个数/并集个数=8/16=0.5,也即句子1和句子2的杰卡德相似度为0.5。之后可将对应杰卡德相似度大于或等于预设阈值的两个句子的相似度确定为1,反之确定为0,进一步保留相似度为1的两个句子作为负例训练样本。For example, sentence 1: Which company are you from and why are you looking for me? 、Sentence 2: Which company do you belong to? I am not the person you mentioned. Remove the punctuation marks in Sentence 1 and Sentence 2 to be converted to set A={you, yes, which, person, company, company,, find, me, why, why}, B={you, yes, which , A, company, company, of, I, no, yes, you, say, of, this, of, person}, get the union A∪B: {you, yes, which, a, company, company, of , Find, me, why, no, say, that, person, person}, get the intersection A∩B: {you, yes, which, person, company, company, the, me}, and further can be calculated The Carder coefficient is: the number of intersections/the number of unions=8/16=0.5, that is, the Jackard similarity of sentence 1 and sentence 2 is 0.5. Afterwards, the similarity of two sentences whose Jaccard similarity is greater than or equal to a preset threshold can be determined as 1, otherwise it can be determined as 0, and the two sentences with similarity of 1 can be further retained as negative example training samples.
213、将正例训练样本及负例训练样本输入至调整后的语义相似度识别模型中,获取得到第二相似度识别结果。213. Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model, and obtain a second similarity recognition result.
在具体的应用场景中,可将正例训练样本及负例训练样本输入至调整后的语义相似度识别模型中,进一步对语义相似度识别模型进行训练修正,获取得到对应的第二相似度识别结果。In specific application scenarios, the positive training samples and negative training samples can be input into the adjusted semantic similarity recognition model, and the semantic similarity recognition model can be further trained and revised to obtain the corresponding second similarity recognition result.
214、确定第二相似度识别结果相对于第二目标识别结果的第二准确度损失。214. Determine the second accuracy loss of the second similarity recognition result relative to the second target recognition result.
在具体的应用场景中,可预先根据正例训练样本及负例训练样本中的标记获取得到第二目标识别结果,在获取得到第二相似度识别结果后,可将第二相似度识别结果与第二目标识别结果进行匹配,根据两者之间的相似度进一步确定出第二准确度损失。In a specific application scenario, the second target recognition result can be obtained in advance according to the marks in the positive training sample and the negative training sample. After the second similarity recognition result is obtained, the second similarity recognition result can be compared with The second target recognition result is matched, and the second accuracy loss is further determined according to the similarity between the two.
215、基于第二准确度损失确定第二损失函数,利用第二损失函数调整后的语义相似度识别模型进行优化,使语义相似度识别模型的识别精度符合预设标准。215. Determine the second loss function based on the second accuracy loss, and optimize the semantic similarity recognition model adjusted by the second loss function, so that the recognition accuracy of the semantic similarity recognition model meets the preset standard.
对于本实施例,训练过程损失函数是softmaxwithloss,学习率learningrate可初始为1e-4,设置随着训练动态衰减学习率,训练收敛并且当识别精度大于或等于预设标准中设定的识别精度后,保存语义相似度识别模型。For this embodiment, the loss function of the training process is softmaxwithloss, the learning rate can be initially 1e-4, and the learning rate is set to dynamically attenuate with training, the training converges and when the recognition accuracy is greater than or equal to the recognition accuracy set in the preset standard , Save the semantic similarity recognition model.
216、将待进行语义相似度识别的目标短文本输入至调整完成的语义相似度识别模型中,获取得到语义相似度。216. Input the target short text to be subjected to semantic similarity recognition into the adjusted semantic similarity recognition model, and obtain the semantic similarity.
在具体的应用场景中,在对语义相似度识别模型完成调整后,即可将待进行语义相似度识别的两个目标短文本输入语义相似度识别模型中,获取得到两个目标短文本间的相似度。In specific application scenarios, after the semantic similarity recognition model is adjusted, the two target short texts to be recognized for semantic similarity can be input into the semantic similarity recognition model to obtain the difference between the two target short texts. Similarity.
217、基于语义相似度确定语义相似度识别结果。217. Determine a semantic similarity recognition result based on the semantic similarity.
对于本实施例,在具体的应用场景中,实施例步骤217具体可以包括:将相似度值与第四预设阈值以及第五预设阈值进行对比;若确定相似度值小于第四预设阈值,则确定语义相似度识别结果为不相似;若确定相似度值大于或等于第四预设阈值且小于第五预设阈值,则确定语义相似度识别结果为中度相似;若确定相似度值大于或等于第五预设阈值,则确定语义相似度识别结果为高度相似;输出相似度识别结果。For this embodiment, in a specific application scenario, step 217 of the embodiment may specifically include: comparing the similarity value with the fourth preset threshold and the fifth preset threshold; if it is determined that the similarity value is less than the fourth preset threshold , The semantic similarity recognition result is determined to be dissimilar; if the similarity value is determined to be greater than or equal to the fourth preset threshold and less than the fifth preset threshold, the semantic similarity recognition result is determined to be moderately similar; if the similarity value is determined If it is greater than or equal to the fifth preset threshold, it is determined that the semantic similarity recognition result is highly similar; and the similarity recognition result is output.
对于本实施例,需要说明的是,根据相似度值确定语义相似度识别结果的方式不仅限于上述一种情况,还可包含多中实施方式,例如还可仅设定一个预设阈值,当相似度值大于该预设阈值时,判定语义相似度识别结果为相似,反之判定为不相似。For this embodiment, it should be noted that the method of determining the semantic similarity recognition result according to the similarity value is not limited to the above-mentioned case, and can also include multiple implementation methods. For example, only one preset threshold can be set. When the degree value is greater than the preset threshold, the semantic similarity recognition result is judged to be similar, otherwise it is judged to be dissimilar.
通过上述文本语义相似度的分析方法,可最大限度地利用有标注的领域的数据来训练语义相似度识别模型,进而将语义相似度识别模型基于迁移学 习的思想应用于目标领域,只需要标注适量的目标领域内数据,利用目标领域数据对语义相似度识别模型调整,训练获取到适用于目标领域的相似度检测模型,进而实现对目标领域短文本相似度的识别判定。相比直接使用通用数据或目标领域数据,或者通用数据与目标领域数据的混合,这种方式既能学习到通用数据的短文本相似度的语义信息,又能有针对性地将这种先验知识应用到目标领域的短文本相似度计算中,实现提升领域内的计算效果,从而也解决了目标领域获取大量训练数据的难题,提高语义相似度计算的精度以及工作效率。Through the above-mentioned text semantic similarity analysis method, the data of the labeled field can be used to the maximum to train the semantic similarity recognition model, and then the semantic similarity recognition model is applied to the target field based on the idea of transfer learning, and only the appropriate amount of labeling is required. Use the target domain data to adjust the semantic similarity recognition model, train to obtain a similarity detection model suitable for the target domain, and then realize the recognition and judgment of the short text similarity in the target domain. Compared with the direct use of general data or target field data, or a mixture of general data and target field data, this method can not only learn the semantic information of the short text similarity of general data, but also can target this priori Knowledge is applied to the calculation of short text similarity in the target field to improve the calculation effect in the field, which also solves the problem of obtaining a large amount of training data in the target field, and improves the accuracy and work efficiency of semantic similarity calculation.
进一步的,作为图1和图2所示方法的具体体现,本申请实施例提供了一种文本语义相似度的分析装置,如图3所示,该装置包括:获取模块31、训练模块32、调整模块33、输入模块34、确定模块35。Further, as a specific embodiment of the method shown in FIG. 1 and FIG. 2, an embodiment of the present application provides a text semantic similarity analysis device. As shown in FIG. 3, the device includes: an acquisition module 31, a training module 32, The adjustment module 33, the input module 34, and the determination module 35.
获取模块31,可用于获取通用数据集以及目标领域数据集;The obtaining module 31 can be used to obtain a general data set and a target field data set;
训练模块32,可用于将通用数据集作为训练样本训练语义相似度识别模型;The training module 32 can be used to train a semantic similarity recognition model using a general data set as a training sample;
调整模块33,可用于利用目标领域数据集作为迁移学习样本调整语义相似度识别模型;The adjustment module 33 can be used to adjust the semantic similarity recognition model by using the target domain data set as the transfer learning sample;
输入模块34,可用于将待进行语义相似度识别的目标短文本输入至调整完成的语义相似度识别模型中,获取得到语义相似度;The input module 34 can be used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
确定模块35,可用于基于语义相似度确定语义相似度识别结果。The determining module 35 can be used to determine the semantic similarity recognition result based on the semantic similarity.
在具体的应用场景中,为了利用通用数据集训练得到训练语义相似度识别模型,训练模块32,具体可用于从所述通用数据集中任意筛选出两个短文本构成待测文本对;对所述待测文本对进行预处理并输入至语义相似度识别模型中的Embedding层,获取得到第一序列和第二序列,所述第一序列对应所述待测文本对中其中一个短文本的映射结果,所述第二序列对应所述待测文本对中另一个短文本的映射结果;将所述第一序列和所述第二系列输入至双向长短时记忆网络BiLSTM中,以便获取得到对应的第一向量以及第二向量;计算所述第一向量及所述第二向量之间的差异性,并获取得到所述第一向量对应加权后的第三序列和所述第二向量对应加权后的第四序列;根据所述第一序列、所述第二序列、所述第三序列和所述第四序列计算得到特征向量;基于所述特征向量输出第一相似度识别结果;确定所述第一相似度识别 结果相对于第一目标识别结果的第一准确度损失;基于所述第一准确度损失确定第一损失函数,利用所述第一损失函数对所述语义相似度识别模型进行优化In a specific application scenario, in order to use a general data set to train to obtain a training semantic similarity recognition model, the training module 32 can be specifically used to arbitrarily filter out two short texts from the general data set to form a text pair to be tested; The text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence, and the first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested The second sequence corresponds to the mapping result of another short text in the pair of texts to be tested; the first sequence and the second sequence are input into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first sequence A vector and a second vector; calculate the difference between the first vector and the second vector, and obtain the weighted third sequence corresponding to the first vector and the weighted second vector The fourth sequence; the feature vector is calculated according to the first sequence, the second sequence, the third sequence, and the fourth sequence; the first similarity recognition result is output based on the feature vector; the first sequence is determined A first accuracy loss of a similarity recognition result relative to a first target recognition result; a first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function
相应的,为了调整得到适用于目标领域的语义相似度识别模型,调整模块33,具体可用于根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型;利用所述目标领域数据集中的历史数据记录构建正例训练样本;基于杰卡德相似性度量方法筛选负例训练样本;将所述正例训练样本及所述负例训练样本输入至调整后的语义相似度识别模型中,获取得到第二相似度识别结果;确定所述第二相似度识别结果相对于第二目标识别结果的第二准确度损失;基于所述第二准确度损失确定第二损失函数,利用所述第二损失函数所述调整后的语义相似度识别模型进行优化,使所述语义相似度识别模型的识别精度符合预设标准。Correspondingly, in order to adjust to obtain a semantic similarity recognition model suitable for the target domain, the adjustment module 33 can be specifically used to adjust the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity; Constructing positive training samples based on historical data records in the target field data set; screening negative training samples based on the Jackard similarity measurement method; inputting the positive training samples and the negative training samples to the adjusted semantics In the similarity recognition model, the second similarity recognition result is obtained; the second accuracy loss of the second similarity recognition result relative to the second target recognition result is determined; the second loss is determined based on the second accuracy loss Function, using the adjusted semantic similarity recognition model of the second loss function to optimize, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
在具体的应用场景中,为了根据目标领域数据集的数据量以及文本相似度的大小调整相似度识别模型,调整模块33,具体可用于若确定所述目标领域数据集的数据量小于或等于第一预设阈值、文本相似度大于第二预设阈值,则修改所述语义相似度识别模型中softmax层的输出类别;若确定所述目标领域数据集的数据量小于或等于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则冻结所述语义相似度识别模型中的初始层,再次训练余下的各层;若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则利用所述目标领域数据集重新训练所述语义相似度识别模型;若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度大于所述第二预设阈值,则保留所述语义相似度识别模型的体系结构和初始权重,并利用所述初始权重来重新训练所述语义相似度识别模型。In a specific application scenario, in order to adjust the similarity recognition model according to the data volume of the target field data set and the size of the text similarity, the adjustment module 33 can be specifically used to determine if the data volume of the target field data set is less than or equal to the first A preset threshold and the text similarity is greater than a second preset threshold, then the output category of the softmax layer in the semantic similarity recognition model is modified; if it is determined that the data volume of the target domain data set is less than or equal to the first preset Set a threshold and the text similarity is less than or equal to the second preset threshold, then freeze the initial layer in the semantic similarity recognition model, and retrain the remaining layers; if the data of the target domain data set is determined If the amount is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is retrained using the target domain data set; if the target domain is determined If the data amount of the data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, the architecture and initial weights of the semantic similarity recognition model are retained, and the initial weights are used To retrain the semantic similarity recognition model.
相应的,为了基于杰卡德相似性度量方法筛选出负例训练样本,调整模块33,具体可用于从目标领域数据集中随机筛选出两个短文本句子构建样本句子对,基于杰卡德相似性度量方法对样本句子对进行相似度计算,获取得到相似度计算结果;若相似度计算结果大于第三预设阈值,则将对应的样本句子对确定为负例训练样本。Correspondingly, in order to screen out negative training samples based on the Jaccard similarity measurement method, the adjustment module 33 can be specifically used to randomly select two short text sentences from the target field data set to construct sample sentence pairs, based on the Jaccard similarity The measurement method performs similarity calculation on the sample sentence pairs to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative training sample.
其中,杰卡德相似性度量方法的计算公式为:
Figure PCTCN2020087554-appb-000018
其中, J(A,B)为相似度计算结果,A为样本句子对中的一个短文本句子,B为样本句子对中的另一个短文本句子。
Among them, the calculation formula of Jaccard's similarity measurement method is:
Figure PCTCN2020087554-appb-000018
Among them, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
在具体的应用场景中,为了基于所述语义相似度确定语义相似度识别结果,确定模块35,具体可用于将所述相似度值与第四预设阈值以及第五预设阈值进行对比;若确定所述相似度值小于所述第四预设阈值,则确定所述语义相似度识别结果为不相似;若确定所述相似度值大于或等于所述第四预设阈值且小于所述第五预设阈值,则确定所述语义相似度识别结果为中度相似;若确定所述相似度值大于或等于所述第五预设阈值,则确定所述语义相似度识别结果为高度相似;In a specific application scenario, in order to determine the semantic similarity recognition result based on the semantic similarity, the determining module 35 may be specifically configured to compare the similarity value with the fourth preset threshold and the fifth preset threshold; if If it is determined that the similarity value is less than the fourth preset threshold, it is determined that the semantic similarity recognition result is dissimilar; if it is determined that the similarity value is greater than or equal to the fourth preset threshold and less than the first Five preset thresholds, determine that the semantic similarity recognition result is moderately similar; if it is determined that the similarity value is greater than or equal to the fifth preset threshold, determine that the semantic similarity recognition result is highly similar;
在具体的应用场景中,为了将语义相似度识别结果显示到显示页面,如图4所示,本装置还包括:输出模块36。In a specific application scenario, in order to display the semantic similarity recognition result on the display page, as shown in FIG. 4, the device further includes: an output module 36.
输出模块36,用于输出相似度识别结果。The output module 36 is used to output the similarity recognition result.
需要说明的是,本实施例提供的一种文本语义相似度的分析装置所涉及各功能单元的其它相应描述,可以参考图1至图2中的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional units involved in the apparatus for analyzing text semantic similarity provided in this embodiment, reference may be made to the corresponding descriptions in FIGS. 1 to 2, and details are not repeated here.
基于上述如图1和图2所示方法,相应的,本申请实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述如图1和图2所示的文本语义相似度的分析方法。存储介质可以是非易失性,也可以是易失性。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景的方法。Based on the above-mentioned methods shown in Figures 1 and 2, correspondingly, an embodiment of the present application also provides a storage medium on which a computer program is stored. The analysis method of the semantic similarity of the text shown. The storage medium may be non-volatile or volatile. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
基于上述如图1、图2所示的方法,以及图3、图4所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,具体可以为个人计算机、服务器、网络设备等,该实体设备包括存储介质和处理器;存储介质,用于存储计算机程序;处理器,用于执行计算机程序以实现上述如图1和图2所示的文本语义相似度的分析方法。Based on the above methods shown in Figures 1 and 2 and the virtual device embodiments shown in Figures 3 and 4, in order to achieve the above objectives, the embodiments of the present application also provide a computer device, which may be a personal computer, Servers, network devices, etc., the physical device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to realize the semantic similarity of the text as shown in FIG. 1 and FIG. Analytical method.

Claims (20)

  1. 一种文本语义相似度的分析方法,包括:A method for analyzing text semantic similarity, including:
    获取通用数据集以及目标领域数据集;Obtain general data sets and target domain data sets;
    将所述通用数据集作为训练样本训练语义相似度识别模型;Training a semantic similarity recognition model using the general data set as a training sample;
    利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;
    将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
    基于所述语义相似度确定语义相似度识别结果。The semantic similarity recognition result is determined based on the semantic similarity.
  2. 根据权利要求1所述的方法,其中,所述将所述通用数据集作为训练样本训练语义相似度识别模型,具体包括:The method according to claim 1, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:
    从所述通用数据集中任意筛选出两个短文本构成待测文本对;Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;
    对所述待测文本对进行预处理并输入至语义相似度识别模型中的Embedding层,获取得到第一序列和第二序列,所述第一序列对应所述待测文本对中其中一个短文本的映射结果,所述第二序列对应所述待测文本对中另一个短文本的映射结果;Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;
    将所述第一序列和所述第二系列输入至双向长短时记忆网络BiLSTM中,以便获取得到对应的第一向量以及第二向量;Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;
    计算所述第一向量及所述第二向量之间的差异性,并获取得到所述第一向量对应加权后的第三序列和所述第二向量对应加权后的第四序列;Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;
    根据所述第一序列、所述第二序列、所述第三序列和所述第四序列计算得到特征向量;Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;
    基于所述特征向量输出第一相似度识别结果;Outputting a first similarity recognition result based on the feature vector;
    确定所述第一相似度识别结果相对于第一目标识别结果的第一准确度损失;Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;
    基于所述第一准确度损失确定第一损失函数,利用所述第一损失函数对所述语义相似度识别模型进行优化。A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
  3. 根据权利要求2所述的方法,其中,所述利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型,具体包括:The method according to claim 2, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a transfer learning sample specifically comprises:
    根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;
    利用所述目标领域数据集中的历史数据记录构建正例训练样本;Use historical data records in the target field data set to construct a positive training sample;
    基于杰卡德相似性度量方法筛选负例训练样本;Screen negative training samples based on Jaccard's similarity measurement method;
    将所述正例训练样本及所述负例训练样本输入至调整后的语义相似度识别模型中,获取得到第二相似度识别结果;Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;
    确定所述第二相似度识别结果相对于第二目标识别结果的第二准确度损失;Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;
    基于所述第二准确度损失确定第二损失函数,利用所述第二损失函数所述调整后的语义相似度识别模型进行优化,使所述语义相似度识别模型的识别精度符合预设标准。A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
  4. 根据权利要求3所述的方法,其中,所述根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型,具体包括:The method according to claim 3, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:
    若确定所述目标领域数据集的数据量小于或等于第一预设阈值、文本相似度大于第二预设阈值,则修改所述语义相似度识别模型中softmax层的输出类别;If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;
    若确定所述目标领域数据集的数据量小于或等于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则冻结所述语义相似度识别模型中的初始层,再次训练余下的各层;If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则利用所述目标领域数据集重新训练所述语义相似度识别模型;If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度大于所述第二预设阈值,则保留所述语义相似度识别模型的体系结构和初始权重,并利用所述初始权重来重新训练所述语义相似度识别模型。If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
  5. 根据权利要求3所述的方法,其中,所述基于杰卡德相似性度量方法筛选负例训练样本,具体包括:The method according to claim 3, wherein the screening of negative training samples based on the Jackard similarity measurement method specifically comprises:
    从所述目标领域数据集中随机筛选出两个短文本句子构建样本句子对,基于杰卡德相似性度量方法对所述样本句子对进行相似度计算,获取得到相似度计算结果;Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;
    若所述相似度计算结果大于第三预设阈值,则将对应的所述样本句子对确定为负例训练样本。If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
  6. 根据权利要求5所述的方法,其中,所述杰卡德相似性度量方法的计 算公式为:
    Figure PCTCN2020087554-appb-100001
    其中,J(A,B)为相似度计算结果,A为所述样本句子对中的一个短文本句子,B为所述样本句子对中的另一个短文本句子。
    The method according to claim 5, wherein the calculation formula of the Jaccard similarity measurement method is:
    Figure PCTCN2020087554-appb-100001
    Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
  7. 根据权利要求6所述的方法,其中,所述基于所述语义相似度确定语义相似度识别结果,具体包括:The method according to claim 6, wherein the determining the semantic similarity recognition result based on the semantic similarity specifically comprises:
    将所述相似度值与第四预设阈值以及第五预设阈值进行对比;Comparing the similarity value with a fourth preset threshold and a fifth preset threshold;
    若确定所述相似度值小于所述第四预设阈值,则确定所述语义相似度识别结果为不相似;If it is determined that the similarity value is less than the fourth preset threshold, determining that the semantic similarity recognition result is not similar;
    若确定所述相似度值大于或等于所述第四预设阈值且小于所述第五预设阈值,则确定所述语义相似度识别结果为中度相似;If it is determined that the similarity value is greater than or equal to the fourth preset threshold and less than the fifth preset threshold, determining that the semantic similarity recognition result is moderately similar;
    若确定所述相似度值大于或等于所述第五预设阈值,则确定所述语义相似度识别结果为高度相似;If it is determined that the similarity value is greater than or equal to the fifth preset threshold, determining that the semantic similarity recognition result is highly similar;
    在所述基于所述语义相似度确定语义相似度识别结果之后,具体还包括:After the semantic similarity recognition result is determined based on the semantic similarity, it specifically further includes:
    输出所述相似度识别结果。Output the similarity recognition result.
  8. 一种文本语义相似度的分析装置,其中,包括:A device for analyzing text semantic similarity, which includes:
    获取模块,用于获取通用数据集以及目标领域数据集;The acquisition module is used to acquire general data sets and target domain data sets;
    训练模块,用于将所述通用数据集作为训练样本训练语义相似度识别模型;A training module for training a semantic similarity recognition model using the general data set as a training sample;
    调整模块,用于利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;An adjustment module for adjusting the semantic similarity recognition model by using the target domain data set as a migration learning sample;
    输入模块,用于将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;The input module is used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
    确定模块,用于基于所述语义相似度确定语义相似度识别结果。The determining module is used to determine the semantic similarity recognition result based on the semantic similarity.
  9. 一种非易失性可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现权利要求如下步骤:A non-volatile readable storage medium with a computer program stored thereon, and when the program is executed by a processor, the following steps of the claims are realized:
    获取通用数据集以及目标领域数据集;Obtain general data sets and target domain data sets;
    将所述通用数据集作为训练样本训练语义相似度识别模型;Training a semantic similarity recognition model using the general data set as a training sample;
    利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;
    将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
    基于所述语义相似度确定语义相似度识别结果。The semantic similarity recognition result is determined based on the semantic similarity.
  10. 根据权利要求9所述的非易失性可读存储介质,其中,所述将所述通用数据集作为训练样本训练语义相似度识别模型,具体包括:The non-volatile readable storage medium according to claim 9, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:
    从所述通用数据集中任意筛选出两个短文本构成待测文本对;Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;
    对所述待测文本对进行预处理并输入至语义相似度识别模型中的Embedding层,获取得到第一序列和第二序列,所述第一序列对应所述待测文本对中其中一个短文本的映射结果,所述第二序列对应所述待测文本对中另一个短文本的映射结果;Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;
    将所述第一序列和所述第二系列输入至双向长短时记忆网络BiLSTM中,以便获取得到对应的第一向量以及第二向量;Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;
    计算所述第一向量及所述第二向量之间的差异性,并获取得到所述第一向量对应加权后的第三序列和所述第二向量对应加权后的第四序列;Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;
    根据所述第一序列、所述第二序列、所述第三序列和所述第四序列计算得到特征向量;Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;
    基于所述特征向量输出第一相似度识别结果;Outputting a first similarity recognition result based on the feature vector;
    确定所述第一相似度识别结果相对于第一目标识别结果的第一准确度损失;Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;
    基于所述第一准确度损失确定第一损失函数,利用所述第一损失函数对所述语义相似度识别模型进行优化。A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
  11. 根据权利要求10所述的非易失性可读存储介质,其中,所述利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型,具体包括:The non-volatile readable storage medium according to claim 10, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a transfer learning sample specifically comprises:
    根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;
    利用所述目标领域数据集中的历史数据记录构建正例训练样本;Use historical data records in the target field data set to construct a positive training sample;
    基于杰卡德相似性度量方法筛选负例训练样本;Screen negative training samples based on Jaccard's similarity measurement method;
    将所述正例训练样本及所述负例训练样本输入至调整后的语义相似度识别模型中,获取得到第二相似度识别结果;Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;
    确定所述第二相似度识别结果相对于第二目标识别结果的第二准确度损失;Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;
    基于所述第二准确度损失确定第二损失函数,利用所述第二损失函数所述调整后的语义相似度识别模型进行优化,使所述语义相似度识别模型的识 别精度符合预设标准。A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
  12. 根据权利要求11所述的非易失性可读存储介质,其中,所述根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型,具体包括:The non-volatile readable storage medium according to claim 11, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:
    若确定所述目标领域数据集的数据量小于或等于第一预设阈值、文本相似度大于第二预设阈值,则修改所述语义相似度识别模型中softmax层的输出类别;If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;
    若确定所述目标领域数据集的数据量小于或等于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则冻结所述语义相似度识别模型中的初始层,再次训练余下的各层;If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则利用所述目标领域数据集重新训练所述语义相似度识别模型;If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度大于所述第二预设阈值,则保留所述语义相似度识别模型的体系结构和初始权重,并利用所述初始权重来重新训练所述语义相似度识别模型。If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
  13. 根据权利要求11所述的非易失性可读存储介质,其中,所述基于杰卡德相似性度量方法筛选负例训练样本,具体包括:The non-volatile readable storage medium according to claim 11, wherein the screening of negative training samples based on the Jackard similarity metric method specifically comprises:
    从所述目标领域数据集中随机筛选出两个短文本句子构建样本句子对,基于杰卡德相似性度量方法对所述样本句子对进行相似度计算,获取得到相似度计算结果;Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;
    若所述相似度计算结果大于第三预设阈值,则将对应的所述样本句子对确定为负例训练样本。If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
  14. 根据权利要求13所述的非易失性可读存储介质,其中,所述杰卡德相似性度量方法的计算公式为:
    Figure PCTCN2020087554-appb-100002
    其中,J(A,B)为相似度计算结果,A为所述样本句子对中的一个短文本句子,B为所述样本句子对中的另一个短文本句子。
    The non-volatile readable storage medium according to claim 13, wherein the calculation formula of the Jaccard similarity measurement method is:
    Figure PCTCN2020087554-appb-100002
    Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
  15. 一种计算机设备,包括非易失性可读存储介质、处理器及存储在非易失性可读存储介质上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如下步骤:A computer device, including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and running on the processor, the processor implements The following steps:
    获取通用数据集以及目标领域数据集;Obtain general data sets and target domain data sets;
    将所述通用数据集作为训练样本训练语义相似度识别模型;Training a semantic similarity recognition model using the general data set as a training sample;
    利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;
    将待进行语义相似度识别的目标短文本输入至调整完成的所述语义相似度识别模型中,获取得到语义相似度;Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;
    基于所述语义相似度确定语义相似度识别结果。The semantic similarity recognition result is determined based on the semantic similarity.
  16. 根据权利要求15所述的计算机设备,其中,所述将所述通用数据集作为训练样本训练语义相似度识别模型,具体包括:The computer device according to claim 15, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:
    从所述通用数据集中任意筛选出两个短文本构成待测文本对;Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;
    对所述待测文本对进行预处理并输入至语义相似度识别模型中的Embedding层,获取得到第一序列和第二序列,所述第一序列对应所述待测文本对中其中一个短文本的映射结果,所述第二序列对应所述待测文本对中另一个短文本的映射结果;Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;
    将所述第一序列和所述第二系列输入至双向长短时记忆网络BiLSTM中,以便获取得到对应的第一向量以及第二向量;Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;
    计算所述第一向量及所述第二向量之间的差异性,并获取得到所述第一向量对应加权后的第三序列和所述第二向量对应加权后的第四序列;Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;
    根据所述第一序列、所述第二序列、所述第三序列和所述第四序列计算得到特征向量;Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;
    基于所述特征向量输出第一相似度识别结果;Outputting a first similarity recognition result based on the feature vector;
    确定所述第一相似度识别结果相对于第一目标识别结果的第一准确度损失;Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;
    基于所述第一准确度损失确定第一损失函数,利用所述第一损失函数对所述语义相似度识别模型进行优化。A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
  17. 根据权利要求16所述的计算机设备,其中,所述利用所述目标领域数据集作为迁移学习样本调整所述语义相似度识别模型,具体包括:The computer device according to claim 16, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a migration learning sample specifically comprises:
    根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型;Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;
    利用所述目标领域数据集中的历史数据记录构建正例训练样本;Use historical data records in the target field data set to construct a positive training sample;
    基于杰卡德相似性度量方法筛选负例训练样本;Screen negative training samples based on Jaccard's similarity measurement method;
    将所述正例训练样本及所述负例训练样本输入至调整后的语义相似度识 别模型中,获取得到第二相似度识别结果;Inputting the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;
    确定所述第二相似度识别结果相对于第二目标识别结果的第二准确度损失;Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;
    基于所述第二准确度损失确定第二损失函数,利用所述第二损失函数所述调整后的语义相似度识别模型进行优化,使所述语义相似度识别模型的识别精度符合预设标准。A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
  18. 根据权利要求17所述的计算机设备,其中,所述根据所述目标领域数据集的数据量以及文本相似度的大小调整所述语义相似度识别模型,具体包括:18. The computer device according to claim 17, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:
    若确定所述目标领域数据集的数据量小于或等于第一预设阈值、文本相似度大于第二预设阈值,则修改所述语义相似度识别模型中softmax层的输出类别;If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;
    若确定所述目标领域数据集的数据量小于或等于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则冻结所述语义相似度识别模型中的初始层,再次训练余下的各层;If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度小于或等于所述第二预设阈值,则利用所述目标领域数据集重新训练所述语义相似度识别模型;If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;
    若确定所述目标领域数据集的数据量大于所述第一预设阈值、所述文本相似度大于所述第二预设阈值,则保留所述语义相似度识别模型的体系结构和初始权重,并利用所述初始权重来重新训练所述语义相似度识别模型。If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
  19. 根据权利要求17所述的计算机设备,其中,所述基于杰卡德相似性度量方法筛选负例训练样本,具体包括:18. The computer device according to claim 17, wherein the screening of negative training samples based on the Jackard similarity measurement method specifically comprises:
    从所述目标领域数据集中随机筛选出两个短文本句子构建样本句子对,基于杰卡德相似性度量方法对所述样本句子对进行相似度计算,获取得到相似度计算结果;Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;
    若所述相似度计算结果大于第三预设阈值,则将对应的所述样本句子对确定为负例训练样本。If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
  20. 根据权利要求19所述的计算机设备,其中,所述杰卡德相似性度量方法的计算公式为:
    Figure PCTCN2020087554-appb-100003
    其中,J(A,B)为相似度计算结果,A为所述样本句子对中的一个短文本句子,B为所述样本句子对中的另一个短文 本句子。
    The computer device according to claim 19, wherein the calculation formula of the Jaccard similarity measurement method is:
    Figure PCTCN2020087554-appb-100003
    Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
PCT/CN2020/087554 2020-02-14 2020-04-28 Text semantic similarity analysis method and apparatus, and computer device WO2021159613A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010092595.3 2020-02-14
CN202010092595.3A CN111368024A (en) 2020-02-14 2020-02-14 Text semantic similarity analysis method and device and computer equipment

Publications (1)

Publication Number Publication Date
WO2021159613A1 true WO2021159613A1 (en) 2021-08-19

Family

ID=71206129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087554 WO2021159613A1 (en) 2020-02-14 2020-04-28 Text semantic similarity analysis method and apparatus, and computer device

Country Status (2)

Country Link
CN (1) CN111368024A (en)
WO (1) WO2021159613A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779994A (en) * 2021-08-25 2021-12-10 上海浦东发展银行股份有限公司 Element extraction method and device, computer equipment and storage medium
CN114202013A (en) * 2021-11-22 2022-03-18 西北工业大学 Semantic similarity calculation method based on self-adaptive semi-supervision
CN114358210A (en) * 2022-01-14 2022-04-15 平安科技(深圳)有限公司 Text similarity calculation method and device, computer equipment and storage medium
CN114445818A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Article identification method, article identification device, electronic equipment and computer-readable storage medium
CN114595306A (en) * 2022-01-26 2022-06-07 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114648648A (en) * 2022-02-21 2022-06-21 清华大学 Deep introspection amount learning method and device and storage medium
CN114186548B (en) * 2021-12-15 2023-08-15 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and medium based on artificial intelligence
CN116798417A (en) * 2023-07-31 2023-09-22 成都赛力斯科技有限公司 Voice intention recognition method, device, electronic equipment and storage medium
CN116932702A (en) * 2023-09-19 2023-10-24 湖南正宇软件技术开发有限公司 Method, system, device and storage medium for proposal and proposal
CN117112735A (en) * 2023-10-19 2023-11-24 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069833B (en) * 2020-09-01 2024-04-30 北京声智科技有限公司 Log analysis method, log analysis device and electronic equipment
CN112241626B (en) * 2020-10-14 2023-07-07 网易(杭州)网络有限公司 Semantic matching and semantic similarity model training method and device
CN112579919B (en) * 2020-12-09 2023-04-21 小红书科技有限公司 Data processing method and device and electronic equipment
CN112863490B (en) * 2021-01-07 2024-04-30 广州欢城文化传媒有限公司 Corpus acquisition method and device
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN113051933B (en) * 2021-05-17 2022-09-06 北京有竹居网络技术有限公司 Model training method, text semantic similarity determination method, device and equipment
CN113705244B (en) * 2021-08-31 2023-08-22 平安科技(深圳)有限公司 Method, device and storage medium for generating countermeasure text sample
CN117113977B (en) * 2023-10-09 2024-04-16 北京信诺软通信息技术有限公司 Method, medium and system for identifying text generated by AI contained in test paper

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329949A (en) * 2017-05-24 2017-11-07 北京捷通华声科技股份有限公司 A kind of semantic matching method and system
CN108363716A (en) * 2017-12-28 2018-08-03 广州索答信息科技有限公司 Realm information method of generating classification model, sorting technique, equipment and storage medium
CN109766540A (en) * 2018-12-10 2019-05-17 平安科技(深圳)有限公司 Generic text information extracting method, device, computer equipment and storage medium
CN110688452A (en) * 2019-08-23 2020-01-14 重庆兆光科技股份有限公司 Text semantic similarity evaluation method, system, medium and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844346B (en) * 2017-02-09 2020-08-25 北京红马传媒文化发展有限公司 Short text semantic similarity discrimination method and system based on deep learning model Word2Vec
GB2573998A (en) * 2018-05-17 2019-11-27 Babylon Partners Ltd Device and method for natural language processing
CN109657232A (en) * 2018-11-16 2019-04-19 北京九狐时代智能科技有限公司 A kind of intension recognizing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329949A (en) * 2017-05-24 2017-11-07 北京捷通华声科技股份有限公司 A kind of semantic matching method and system
CN108363716A (en) * 2017-12-28 2018-08-03 广州索答信息科技有限公司 Realm information method of generating classification model, sorting technique, equipment and storage medium
CN109766540A (en) * 2018-12-10 2019-05-17 平安科技(深圳)有限公司 Generic text information extracting method, device, computer equipment and storage medium
CN110688452A (en) * 2019-08-23 2020-01-14 重庆兆光科技股份有限公司 Text semantic similarity evaluation method, system, medium and device

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779994A (en) * 2021-08-25 2021-12-10 上海浦东发展银行股份有限公司 Element extraction method and device, computer equipment and storage medium
CN113779994B (en) * 2021-08-25 2024-01-23 上海浦东发展银行股份有限公司 Element extraction method, element extraction device, computer equipment and storage medium
CN114202013A (en) * 2021-11-22 2022-03-18 西北工业大学 Semantic similarity calculation method based on self-adaptive semi-supervision
CN114202013B (en) * 2021-11-22 2024-04-12 西北工业大学 Semantic similarity calculation method based on self-adaptive semi-supervision
CN114186548B (en) * 2021-12-15 2023-08-15 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and medium based on artificial intelligence
CN114358210A (en) * 2022-01-14 2022-04-15 平安科技(深圳)有限公司 Text similarity calculation method and device, computer equipment and storage medium
CN114595306A (en) * 2022-01-26 2022-06-07 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114595306B (en) * 2022-01-26 2024-04-12 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114445818B (en) * 2022-01-29 2023-08-01 北京百度网讯科技有限公司 Article identification method, apparatus, electronic device, and computer-readable storage medium
CN114445818A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Article identification method, article identification device, electronic equipment and computer-readable storage medium
CN114648648A (en) * 2022-02-21 2022-06-21 清华大学 Deep introspection amount learning method and device and storage medium
CN116798417A (en) * 2023-07-31 2023-09-22 成都赛力斯科技有限公司 Voice intention recognition method, device, electronic equipment and storage medium
CN116798417B (en) * 2023-07-31 2023-11-10 成都赛力斯科技有限公司 Voice intention recognition method, device, electronic equipment and storage medium
CN116932702A (en) * 2023-09-19 2023-10-24 湖南正宇软件技术开发有限公司 Method, system, device and storage medium for proposal and proposal
CN117112735A (en) * 2023-10-19 2023-11-24 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment
CN117112735B (en) * 2023-10-19 2024-02-13 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment

Also Published As

Publication number Publication date
CN111368024A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
WO2021159613A1 (en) Text semantic similarity analysis method and apparatus, and computer device
US10586155B2 (en) Clarification of submitted questions in a question and answer system
US11093560B2 (en) Stacked cross-modal matching
CN107526799B (en) Knowledge graph construction method based on deep learning
WO2019119505A1 (en) Face recognition method and device, computer device and storage medium
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
WO2021253904A1 (en) Test case set generation method, apparatus and device, and computer readable storage medium
CN107944559B (en) Method and system for automatically identifying entity relationship
WO2018086470A1 (en) Keyword extraction method and device, and server
JP7153004B2 (en) COMMUNITY Q&A DATA VERIFICATION METHOD, APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM
WO2020087774A1 (en) Concept-tree-based intention recognition method and apparatus, and computer device
US20150339290A1 (en) Context Based Synonym Filtering for Natural Language Processing Systems
US9535980B2 (en) NLP duration and duration range comparison methodology using similarity weighting
US20150178321A1 (en) Image-based 3d model search and retrieval
US11544470B2 (en) Efficient determination of user intent for natural language expressions based on machine learning
WO2022188773A1 (en) Text classification method and apparatus, device, computer-readable storage medium, and computer program product
CN107844533A (en) A kind of intelligent Answer System and analysis method
WO2019232893A1 (en) Method and device for text emotion analysis, computer apparatus and storage medium
CN112509690B (en) Method, apparatus, device and storage medium for controlling quality
CN111783903B (en) Text processing method, text model processing method and device and computer equipment
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN113094578A (en) Deep learning-based content recommendation method, device, equipment and storage medium
WO2021169423A1 (en) Quality test method, apparatus and device for customer service recording, and storage medium
WO2023207096A1 (en) Entity linking method and apparatus, device, and nonvolatile readable storage medium
CN116992007B (en) Limiting question-answering system based on question intention understanding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918257

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 110123)

122 Ep: pct application non-entry in european phase

Ref document number: 20918257

Country of ref document: EP

Kind code of ref document: A1