CN111259126A - Method, Apparatus, Equipment and Storage Medium for Similarity Calculation Based on Word Features - Google Patents

Method, Apparatus, Equipment and Storage Medium for Similarity Calculation Based on Word Features Download PDF

Info

Publication number
CN111259126A
CN111259126A CN202010042471.4A CN202010042471A CN111259126A CN 111259126 A CN111259126 A CN 111259126A CN 202010042471 A CN202010042471 A CN 202010042471A CN 111259126 A CN111259126 A CN 111259126A
Authority
CN
China
Prior art keywords
question text
text
similarity
word
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010042471.4A
Other languages
Chinese (zh)
Other versions
CN111259126B (en
Inventor
金培根
刘志慧
陆林炳
何斐斐
林加新
李炫�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010042471.4A priority Critical patent/CN111259126B/en
Publication of CN111259126A publication Critical patent/CN111259126A/en
Application granted granted Critical
Publication of CN111259126B publication Critical patent/CN111259126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0281Customer communication at a business location, e.g. providing product or service information, consulting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Technology Law (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及人工智能领域,公开了基于词语特征的相似度计算方法、装置、设备及存储介质,用于提高特定业务场景下文本相似度值的计算准确性。本发明方法包括:获取原始问题文本;根据原始问题文本和预置的应用场景确定目标应用场景,并获取目标应用场景对应的目标分词标准以及多个语义相似的相似问题文本;在多个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本,并根据目标分词标准提取原始问题文本的词语特征和候选问题文本的词语特征;得到正向文本相似度和反向文本相似度;生成相似度匹配分值;将数值最大的候选相似度值确定为目标相似度值,并选择目标相似度值对应的候选问题文本作为标准问题文本。

Figure 202010042471

The invention relates to the field of artificial intelligence, and discloses a word feature-based similarity calculation method, device, equipment and storage medium, which are used to improve the calculation accuracy of text similarity values in specific business scenarios. The method of the invention includes: obtaining original question text; determining a target application scene according to the original question text and a preset application scene, and obtaining a target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics; Select any similar question text from the similar question texts as candidate question text, and extract the word features of the original question text and the word features of the candidate question text according to the target word segmentation standard; get the forward text similarity and reverse text similarity; generate Similarity matching score; the candidate similarity value with the largest value is determined as the target similarity value, and the candidate question text corresponding to the target similarity value is selected as the standard question text.

Figure 202010042471

Description

基于词语特征的相似度计算方法、装置、设备及存储介质Method, Apparatus, Equipment and Storage Medium for Similarity Calculation Based on Word Features

技术领域technical field

本发明涉及相似度匹配技术领域,尤其涉及一种基于词语特征的相似度 计算方法、装置、设备及存储介质。The present invention relates to the technical field of similarity matching, and in particular, to a method, apparatus, device and storage medium for calculating similarity based on word features.

背景技术Background technique

在传统的客服系统或培训系统中,往往需要投入大量的人力和资源去响 应业务请求,且对工作人员的专业性和熟练性要求较高,运营成本居高不下, 且人力留存率较低,因此在智能化转型过程中,对于智能问答系统的需求迫 在眉睫。In the traditional customer service system or training system, it is often necessary to invest a lot of manpower and resources to respond to business requests, and the professionalism and proficiency of the staff are required to be high, the operating cost remains high, and the manpower retention rate is low. Therefore, in the process of intelligent transformation, the demand for intelligent question answering system is imminent.

目前业内主流的搭建智能问答系统的方式是基于检索式的方法,即从问 答系统知识库中召回跟用户问题最相似的问题,其中文本相似度计算是检索 式召回中的核心模块。现有的文本相似度计算方法主要包括深度学习模型方 式、单层次的字面匹配计算(例如关键词匹配、编辑距离、杰卡德jaccard相 似度等)。其中基于深度学习模型的语义表征方法对数据的样本量需求较大, 且对新增问题库语料迭代缓慢,不易于扩展,而基于单层次的字面匹配方法, 本质上是设置了每个token(词/字)的权重一样,无法有效地体现出具体业务 场景下不同层次词语的贡献程度,从而影响了具体业务场景下文本相似度的 度量。At present, the mainstream way of building an intelligent question answering system in the industry is based on the retrieval method, that is, the most similar questions to the user's questions are recalled from the knowledge base of the question answering system, and the text similarity calculation is the core module in the retrieval type recall. The existing text similarity calculation methods mainly include deep learning model method, single-level literal matching calculation (such as keyword matching, edit distance, Jaccard similarity, etc.). Among them, the semantic representation method based on the deep learning model has a large demand for the sample size of the data, and iterates slowly for the newly added question database corpus, which is not easy to expand, while the single-level literal matching method essentially sets each token ( The weight of words/characters) is the same, which cannot effectively reflect the contribution of words at different levels in specific business scenarios, thus affecting the measurement of text similarity in specific business scenarios.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于词语特征的相似度计算方法、装置、设备及存储 介质,用于反映特定业务场景下需要优先匹配的词语类型,体现文本语义包 含关系,提高特定业务场景下文本相似度值的计算准确性。The invention provides a similarity calculation method, device, equipment and storage medium based on word features, which are used to reflect the type of words that need to be preferentially matched in a specific business scenario, reflect the semantic inclusion relationship of text, and improve the text similarity in the specific business scenario. The calculation accuracy of the value.

本发明实施例的第一方面提供一种基于词语特征的相似度计算方法,包 括:获取原始问题文本,所述原始问题文本用于指示查找所述原始问题文本 对应的答案;根据所述原始问题文本和预置的应用场景确定目标应用场景, 并获取所述目标应用场景对应的目标分词标准以及多个语义相似的相似问题 文本,所述预置的应用场景包含预先设置的多个候选场景;在所述多个语义 相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本,并根 据所述目标分词标准提取所述原始问题文本的词语特征和所述候选问题文本 的词语特征;根据所述原始问题文本的词语特征和所述候选问题文本的词语 特征分别进行计算,得到正向文本相似度和反向文本相似度;将所述正向文 本相似度和所述反向文本相似度进行特征融合,生成相似度匹配分值,所述 相似度匹配分值用于指示所述原始问题文本与所述候选问题文本之间的相似 程度;将所述多个候选问题文本对应的候选相似度值进行比较,将数值最大 的候选相似度值确定为目标相似度值,并选择所述目标相似度值对应的候选 问题文本作为标准问题文本。A first aspect of the embodiments of the present invention provides a method for calculating similarity based on word features, including: obtaining an original question text, where the original question text is used to instruct to find an answer corresponding to the original question text; The text and the preset application scene determine the target application scene, and obtain the target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics, and the preset application scene includes a plurality of preset candidate scenes; Select any one of the similar question texts with similar semantics as the candidate question text, and extract the word feature of the original question text and the word feature of the candidate question text according to the target word segmentation standard; The word features of the original question text and the word features of the candidate question text are calculated respectively to obtain the forward text similarity and the reverse text similarity; the forward text similarity and the reverse text similarity are calculated. Perform feature fusion to generate a similarity matching score, the similarity matching score is used to indicate the degree of similarity between the original question text and the candidate question text; the candidates corresponding to the multiple candidate question texts are similar The candidate similarity value with the largest numerical value is determined as the target similarity value, and the candidate question text corresponding to the target similarity value is selected as the standard question text.

可选的,在本发明实施例第一方面的第一种实现方式中,所述根据所述 原始问题文本和预置的应用场景确定目标应用场景,并获取所述目标应用场 景对应的目标分词标准以及多个语义相似的相似问题文本,所述预置的应用 场景包含预先设置的多个候选场景,包括:根据所述原始问题文本在预置的 应用场景中选择任意一个应用场景作为目标应用场景,所述预置的应用场景 包含多个预先设置的应用场景;获取所述目标应用场景对应的目标分词标准; 在所述目标应用场景下查找与所述原始问题文本语义相似的相似问题文本。Optionally, in the first implementation manner of the first aspect of the embodiment of the present invention, the target application scene is determined according to the original question text and the preset application scene, and the target word segmentation corresponding to the target application scene is obtained. standard and multiple similar question texts with similar semantics, the preset application scenarios include multiple preset candidate scenarios, including: selecting any one of the preset application scenarios as the target application according to the original question text Scenario, the preset application scenario includes multiple preset application scenarios; obtain the target word segmentation standard corresponding to the target application scenario; search for a similar question text semantically similar to the original question text in the target application scenario .

可选的,在本发明实施例第一方面的第二种实现方式中,所述在所述多 个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文 本,并根据所述目标分词标准提取所述原始问题文本的词语特征和所述候选 问题文本的词语特征,包括:在所述多个语义相似的相似问题文本中选择任 意一个相似问题文本作为候选问题文本;基于所述目标分词标准对所述原始 问题文本和所述候选问题文本分别进行分词和命名实体识别,得到所述原始 问题文本的词语特征和所述候选问题文本的词语特征。Optionally, in the second implementation manner of the first aspect of the embodiment of the present invention, selecting any one of the similar question texts with similar semantics as the candidate question text, and selecting the The word segmentation standard extracts the word features of the original question text and the word features of the candidate question text, including: selecting any one of the similar question texts with similar semantics as the candidate question text; based on the target The word segmentation standard performs word segmentation and named entity recognition on the original question text and the candidate question text, respectively, to obtain word features of the original question text and word features of the candidate question text.

可选的,在本发明实施例第一方面的第三种实现方式中,所述基于所述 目标分词标准对所述原始问题文本和所述候选问题文本分别进行分词和命名 实体识别,得到所述原始问题文本的词语特征和所述候选问题文本的词语特 征,包括:基于所述目标分词标准对所述原始问题文本进行分词,得到原始 问题文本的分词结果;基于所述目标分词标准对候选问题文本进行分词,得 到候选问题文本的分词结果;对所述原始问题文本的分词结果和所述候选问 题文本的分词结果分别进行命名实体识别,得到原始问题文本的词语特征和 候选问题文本的词语特征,所述原始问题文本的词语特征包括标注好的原始 词语和对应的原始词语词性,所述候选问题文本的词语特征包括标注好的候 选词语和对应的候选词语词性。Optionally, in a third implementation manner of the first aspect of the embodiment of the present invention, the original question text and the candidate question text are respectively subjected to word segmentation and named entity recognition based on the target word segmentation criteria, to obtain the result. Describe the word features of the original question text and the word features of the candidate question text, including: performing word segmentation on the original question text based on the target word segmentation standard to obtain a word segmentation result of the original question text; Perform word segmentation on the question text to obtain the word segmentation result of the candidate question text; perform named entity recognition on the word segmentation result of the original question text and the word segmentation result of the candidate question text respectively, and obtain the word features of the original question text and the words of the candidate question text. The word features of the original question text include the marked original words and the corresponding original word parts of speech, and the word features of the candidate question text include the marked candidate words and the corresponding candidate word parts of speech.

可选的,在本发明实施例第一方面的第四种实现方式中,所述基于所述 目标分词标准对所述原始问题文本和所述候选问题文本分别进行分词和命名 实体识别,得到所述原始问题文本的词语特征和所述候选问题文本的词语特 征,包括:基于所述目标分词标准对原始问题文本进行分词,得到原始问题 文本的分词结果;获取所述候选问题文本的预置分词结果,其中,所述预置 分词结果为根据所述目标分词标准预先对候选问题文本进行离线分词的结 果;对所述原始问题文本的分词结果和所述候选问题文本的预置分词结果分别进行命名实体识别,得到原始问题文本的词语特征和候选问题文本的词语 特征,所述原始问题文本的词语特征包括标注好的原始词语和对应的原始词 语词性,所述候选问题文本的词语特征包括标注好的候选词语和对应的候选 词语词性。Optionally, in a fourth implementation manner of the first aspect of the embodiment of the present invention, the original question text and the candidate question text are respectively subjected to word segmentation and named entity recognition based on the target word segmentation criteria, to obtain the result. Describe the word features of the original question text and the word features of the candidate question text, including: performing word segmentation on the original question text based on the target word segmentation standard to obtain the word segmentation result of the original question text; obtaining the preset word segmentation of the candidate question text As a result, the preset word segmentation result is the result of performing offline word segmentation on the candidate question text in advance according to the target word segmentation standard; the word segmentation result of the original question text and the preset word segmentation result of the candidate question text are separately performed. Named entity recognition to obtain word features of the original question text and word features of the candidate question text, where the word features of the original question text include the labeled original words and the corresponding original word parts of speech, and the word features of the candidate question text include the labeled original words A good candidate word and the corresponding candidate word part-of-speech.

可选的,在本发明实施例第一方面的第五种实现方式中,所述根据所述 原始问题文本的词语特征和所述候选问题文本的词语特征分别进行计算,得 到正向文本相似度和反向文本相似度,包括:Optionally, in a fifth implementation manner of the first aspect of the embodiment of the present invention, the calculation is performed separately according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity. and reverse text similarity, including:

将原始问题文本确定为基准问题文本,将候选问题文本确定为匹配问题 文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式为

Figure BDA0002368232150000031
其中A表示基准问题文本,B表示匹 配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题文 本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下标 的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的 相似度系数,
Figure BDA0002368232150000032
将候选问题文本确定为 基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹配公式 计算得到反向文本相似度。The original question text is determined as the reference question text, the candidate question text is determined as the matching question text, and the forward text similarity is calculated based on the preset matching formula. The preset matching formula is
Figure BDA0002368232150000031
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure BDA0002368232150000032
The candidate question text is determined as the reference question text, the original question text is determined as the matching question text, and the reverse text similarity is calculated based on the preset matching formula.

可选的,在本发明实施例第一方面的第六种实现方式中,所述将所述正 向文本相似度和所述反向文本相似度进行特征融合,生成相似度匹配分值, 所述相似度匹配分值用于指示所述原始问题文本与所述候选问题文本之间的 相似程度,包括:通过预置公式将正向文本相似度和反向文本相似度进行融 合,预置公式为:score=w1*score(正向)+w2*score(反向)+b,其中,b为常数,w1、w2为权重常数;计算得到相似度匹配分值score,所述相似度匹配分值指 示所述原始问题文本与所述候选问题文本之间的相似程度。Optionally, in the sixth implementation manner of the first aspect of the embodiment of the present invention, the feature fusion is performed on the forward text similarity and the reverse text similarity to generate a similarity matching score, so The similarity matching score is used to indicate the degree of similarity between the original question text and the candidate question text, including: the forward text similarity and the reverse text similarity are fused by a preset formula, and the preset formula It is: score=w1*score (forward)+w2*score (reverse)+b, where b is a constant, and w1 and w2 are weight constants; the similarity matching score score is calculated, and the similarity matching score is The value indicates the degree of similarity between the original question text and the candidate question text.

本发明实施例的第二方面提供了一种基于词语特征的相似度计算装置, 包括:获取单元,用于获取原始问题文本,所述原始问题文本用于指示查找 所述原始问题文本对应的答案;确定单元,用于根据所述原始问题文本和预 置的应用场景确定目标应用场景,并获取所述目标应用场景对应的目标分词 标准以及多个语义相似的相似问题文本,所述预置的应用场景包含预先设置 的多个候选场景;选择提取单元,用于在所述多个语义相似的相似问题文本 中选择任意一个相似问题文本作为候选问题文本,并根据所述目标分词标准 提取所述原始问题文本的词语特征和所述候选问题文本的词语特征;计算单 元,用于根据所述原始问题文本的词语特征和所述候选问题文本的词语特征 分别进行计算,得到正向文本相似度和反向文本相似度;生成单元,用于将 所述正向文本相似度和所述反向文本相似度进行特征融合,生成相似度匹配 分值,所述相似度匹配分值用于指示所述原始问题文本与所述候选问题文本 之间的相似程度;比较选择单元,用于将所述多个候选问题文本对应的候选 相似度值进行比较,将数值最大的候选相似度值确定为目标相似度值,并选择所述目标相似度值对应的候选问题文本作为标准问题文本。A second aspect of the embodiments of the present invention provides an apparatus for calculating similarity based on word features, including: an acquiring unit, configured to acquire original question text, where the original question text is used to instruct to find an answer corresponding to the original question text A determination unit for determining a target application scene according to the original question text and a preset application scene, and acquiring the target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics, the preset application scene The application scene includes a plurality of preset candidate scenes; the selection and extraction unit is used to select any one of the similar question texts with similar semantics as the candidate question text, and extract the The word feature of the original question text and the word feature of the candidate question text; the computing unit is used to calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and Reverse text similarity; a generating unit for feature fusion of the forward text similarity and the reverse text similarity to generate a similarity matching score, and the similarity matching score is used to indicate the The similarity between the original question text and the candidate question text; the comparison selection unit is used to compare the candidate similarity values corresponding to the multiple candidate question texts, and determine the candidate similarity value with the largest numerical value as the target similarity degree value, and select the candidate question text corresponding to the target similarity value as the standard question text.

可选的,在本发明实施例第二方面的第一种实现方式中,确定单元具体 用于:根据所述原始问题文本在预置的应用场景中选择任意一个应用场景作 为目标应用场景,所述预置的应用场景包含多个预先设置的应用场景;获取 所述目标应用场景对应的目标分词标准;在所述目标应用场景下查找与所述 原始问题文本语义相似的相似问题文本。Optionally, in the first implementation manner of the second aspect of the embodiment of the present invention, the determining unit is specifically configured to: select any one of the application scenarios from the preset application scenarios as the target application scenario according to the original question text. The preset application scenario includes a plurality of preset application scenarios; the target word segmentation standard corresponding to the target application scenario is obtained; and the similar question text that is semantically similar to the original question text is searched for in the target application scenario.

可选的,在本发明实施例第二方面的第二种实现方式中,选择提取单元 包括:选择模块,用于在所述多个语义相似的相似问题文本中选择任意一个 相似问题文本作为候选问题文本;分词识别模块,用于基于所述目标分词标 准对所述原始问题文本和所述候选问题文本分别进行分词和命名实体识别, 得到所述原始问题文本的词语特征和所述候选问题文本的词语特征。Optionally, in a second implementation manner of the second aspect of the embodiment of the present invention, the selection and extraction unit includes: a selection module, configured to select any one of the similar question texts with similar semantics as a candidate. question text; a word segmentation recognition module for performing word segmentation and named entity recognition on the original question text and the candidate question text based on the target word segmentation criteria, respectively, to obtain word features of the original question text and the candidate question text. word features.

可选的,在本发明实施例第二方面的第三种实现方式中,分词识别模块 具体用于:基于所述目标分词标准对所述原始问题文本进行分词,得到原始 问题文本的分词结果;基于所述目标分词标准对候选问题文本进行分词,得 到候选问题文本的分词结果;对所述原始问题文本的分词结果和所述候选问 题文本的分词结果分别进行命名实体识别,得到原始问题文本的词语特征和 候选问题文本的词语特征,所述原始问题文本的词语特征包括标注好的原始 词语和对应的原始词语词性,所述候选问题文本的词语特征包括标注好的候 选词语和对应的候选词语词性。Optionally, in a third implementation manner of the second aspect of the embodiment of the present invention, the word segmentation recognition module is specifically configured to: perform word segmentation on the original question text based on the target word segmentation standard, and obtain a word segmentation result of the original question text; Perform word segmentation on the candidate question text based on the target word segmentation standard to obtain the word segmentation result of the candidate question text; perform named entity recognition on the word segmentation result of the original question text and the word segmentation result of the candidate question text respectively, and obtain the word segmentation result of the original question text. word features and word features of the candidate question text, the word features of the original question text include the labeled original words and the corresponding original word parts of speech, the word features of the candidate question text include the labeled candidate words and the corresponding candidate words part of speech.

可选的,在本发明实施例第二方面的第四种实现方式中,分词识别具体 还用于:基于所述目标分词标准对原始问题文本进行分词,得到原始问题文 本的分词结果;获取所述候选问题文本的预置分词结果,其中,所述预置分 词结果为根据所述目标分词标准预先对候选问题文本进行离线分词的结果; 对所述原始问题文本的分词结果和所述候选问题文本的预置分词结果分别进 行命名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征, 所述原始问题文本的词语特征包括标注好的原始词语和对应的原始词语词 性,所述候选问题文本的词语特征包括标注好的候选词语和对应的候选词语 词性。Optionally, in the fourth implementation manner of the second aspect of the embodiment of the present invention, the word segmentation recognition is specifically further used to: perform word segmentation on the original question text based on the target word segmentation standard to obtain a word segmentation result of the original question text; Describe the preset word segmentation result of the candidate question text, wherein, the preset word segmentation result is the result of offline segmentation of the candidate question text in advance according to the target word segmentation standard; the word segmentation result of the original question text and the candidate question Named entity recognition is performed on the preset word segmentation results of the text to obtain the word features of the original question text and the word features of the candidate question text. The word features of the original question text include the marked original words and the corresponding original word parts of speech. The word features of the candidate question text include the labeled candidate words and the corresponding candidate word parts of speech.

可选的,在本发明实施例第二方面的第五种实现方式中,计算单元具体 用于:将原始问题文本确定为基准问题文本,将候选问题文本确定为匹配问 题文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式为Optionally, in a fifth implementation manner of the second aspect of the embodiment of the present invention, the computing unit is specifically configured to: determine the original question text as the reference question text, determine the candidate question text as the matching question text, and determine the matching question text based on the preset question text. The matching formula calculates the forward text similarity, and the preset matching formula is

Figure BDA0002368232150000051
其中A表示基准问题文本,B表示匹配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题文 本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下标 的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的 相似度系数,
Figure BDA0002368232150000052
将候选问题文本确定为 基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹配公式 计算得到反向文本相似度。
Figure BDA0002368232150000051
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure BDA0002368232150000052
The candidate question text is determined as the reference question text, the original question text is determined as the matching question text, and the reverse text similarity is calculated based on the preset matching formula.

可选的,在本发明实施例第二方面的第六种实现方式中,生成单元具体 用于:通过预置公式将正向文本相似度和反向文本相似度进行融合,预置公 式为:score=w1*score(正向)+w2*score(反向)+b,其中,b为常数,w1、w2为 权重常数;计算得到相似度匹配分值score,所述相似度匹配分值指示所述原 始问题文本与所述候选问题文本之间的相似程度。Optionally, in the sixth implementation manner of the second aspect of the embodiment of the present invention, the generating unit is specifically configured to: fuse the forward text similarity and the reverse text similarity through a preset formula, and the preset formula is: score=w1*score (forward)+w2*score (reverse)+b, where b is a constant, and w1 and w2 are weight constants; the similarity matching score score is calculated, and the similarity matching score indicates The degree of similarity between the original question text and the candidate question text.

本发明实施例的第三方面提供了一种基于词语特征的相似度计算设备, 包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算 机程序,所述处理器执行所述计算机程序时实现上述任一实施方式所述的基 于词语特征的相似度计算方法。A third aspect of the embodiments of the present invention provides a word feature-based similarity computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor When the computer program is executed, the word feature-based similarity calculation method described in any of the above embodiments is implemented.

本发明实施例的第四方面提供了一种计算机可读存储介质,所述计算机 可读存储介质存储有计算机程序,当所述计算机程序被处理器执行时实现上 述任一实施方式所述的基于词语特征的相似度计算方法的步骤。A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the system based on any of the foregoing embodiments is implemented. The steps of the similarity calculation method of word features.

本发明实施例提供的技术方案中,获取原始问题文本,所述原始问题文 本用于指示查找所述原始问题文本对应的答案;根据所述原始问题文本和预 置的应用场景确定目标应用场景,并获取所述目标应用场景对应的目标分词 标准以及多个语义相似的相似问题文本,所述预置的应用场景包含预先设置 的多个候选场景;在所述多个语义相似的相似问题文本中选择任意一个相似 问题文本作为候选问题文本,并根据所述目标分词标准提取所述原始问题文 本的词语特征和所述候选问题文本的词语特征;根据所述原始问题文本的词 语特征和所述候选问题文本的词语特征分别进行计算,得到正向文本相似度 和反向文本相似度;将所述正向文本相似度和所述反向文本相似度进行特征 融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题文 本与所述候选问题文本之间的相似程度;将所述多个候选问题文本对应的候 选相似度值进行比较,将数值最大的候选相似度值确定为目标相似度值,并 选择所述目标相似度值对应的候选问题文本作为标准问题文本。本发明实施 例,采用基于用户问题的正向匹配和基于标准问题的反向匹配的方式来分别 计算正向文本相似度和反向文本相似度,将正向文本相似度和反向文本相似 度进行融合得到最终的相似度计算值,反映了特定业务场景下需要优先匹配 的词语类型,体现了文本语义包含关系,提高了特定业务场景下文本相似度 值的计算准确性。In the technical solution provided by the embodiment of the present invention, the original question text is obtained, and the original question text is used to instruct to find the answer corresponding to the original question text; the target application scenario is determined according to the original question text and the preset application scenario, and obtain the target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics, and the preset application scene includes a plurality of preset candidate scenes; in the plurality of similar question texts with similar semantics Select any similar question text as candidate question text, and extract the word features of the original question text and the word features of the candidate question text according to the target word segmentation criteria; according to the word features of the original question text and the candidate The word features of the question text are calculated separately to obtain the forward text similarity and the reverse text similarity; the forward text similarity and the reverse text similarity are feature fusion to generate a similarity matching score, so The similarity matching score is used to indicate the similarity between the original question text and the candidate question text; the candidate similarity values corresponding to the multiple candidate question texts are compared, and the candidate similarity value with the largest value is compared. The value is determined as the target similarity value, and the candidate question text corresponding to the target similarity value is selected as the standard question text. In the embodiment of the present invention, the forward matching based on user questions and the reverse matching based on standard questions are used to calculate the forward text similarity and the reverse text similarity respectively, and the forward text similarity and the reverse text similarity are calculated. The final similarity calculation value is obtained by fusion, which reflects the type of words that need to be preferentially matched in a specific business scenario, reflects the semantic inclusion relationship of the text, and improves the calculation accuracy of the text similarity value in a specific business scenario.

附图说明Description of drawings

图1为本发明实施例中基于词语特征的相似度计算方法的一个实施例示 意图;Fig. 1 is an embodiment schematic diagram of the similarity calculation method based on word feature in the embodiment of the present invention;

图2为本发明实施例中基于词语特征的相似度计算方法的另一个实施例 示意图;Fig. 2 is another embodiment schematic diagram of the similarity calculation method based on word feature in the embodiment of the present invention;

图3为本发明实施例中基于词语特征的相似度计算装置的一个实施例示 意图;3 is a schematic diagram of an embodiment of a similarity computing device based on word features in an embodiment of the present invention;

图4为本发明实施例中基于词语特征的相似度计算装置的另一个实施例 示意图;4 is a schematic diagram of another embodiment of a similarity computing device based on word features in an embodiment of the present invention;

图5为本发明实施例中基于词语特征的相似度计算设备的一个实施例示 意图。Fig. 5 is a schematic diagram of an embodiment of a similarity calculation device based on word features in an embodiment of the present invention.

具体实施方式Detailed ways

本发明提供了一种基于词语特征的相似度计算方法、装置、设备及存储 介质,用于反映特定业务场景下需要优先匹配的词语类型,体现文本语义包 含关系,提高特定业务场景下文本相似度值的计算准确性。The invention provides a similarity calculation method, device, equipment and storage medium based on word features, which are used to reflect the type of words that need to be preferentially matched in a specific business scenario, reflect the semantic inclusion relationship of text, and improve the text similarity in the specific business scenario. The calculation accuracy of the value.

为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实 施例中的附图,对本发明实施例进行描述。In order for those skilled in the art to better understand the solutions of the present invention, the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、 “第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述 特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换, 以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实 施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的 包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不 必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于 这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

请参阅图1,本发明实施例提供的基于词语特征的相似度计算方法的流程 图,具体包括:Referring to Fig. 1, a flowchart of a method for calculating similarity based on word features provided by an embodiment of the present invention specifically includes:

101、获取原始问题文本,原始问题文本用于指示查找原始问题文本对应 的答案。101. Obtain the original question text, where the original question text is used to instruct to find the answer corresponding to the original question text.

服务器获取原始问题文本,原始问题文本用于指示查找原始问题文本对 应的答案。The server obtains the original question text, and the original question text is used to instruct to find the answer corresponding to the original question text.

需要说明的是,对于一个原始问题文本,对应的答案可能有多种,例如, 原始问题文本为:“体检的科目包括哪些?”,对应的答案可能包括下列一个 或多个答案:“耳鼻喉科”、“血常规”、“乙肝五项检查”、“和尿常规检查”、“心 电图”或“B超”等,具体此处不做限定。It should be noted that, for an original question text, there may be multiple corresponding answers, for example, the original question text is: "What subjects are included in the physical examination?", the corresponding answer may include one or more of the following answers: "ENT Department", "Blood Routine", "Five Hepatitis B Tests", "Urine Routine Test", "ECG" or "B-ultrasound", etc., which are not specifically limited here.

可以理解的是,本发明的执行主体可以为基于词语特征的相似度计算装 置,还可以是终端或者服务器,具体此处不做限定。本发明实施例以服务器 为执行主体为例进行说明。It can be understood that the executive body of the present invention may be a similarity calculation device based on word features, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present invention is described by taking the server as the execution subject as an example.

102、根据原始问题文本和预置的应用场景确定目标应用场景,并获取目 标应用场景对应的目标分词标准以及多个语义相似的相似问题文本,预置的 应用场景包含预先设置的多个候选场景。102. Determine a target application scene according to the original question text and a preset application scene, and obtain a target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics. The preset application scene includes a plurality of preset candidate scenes .

具体的,(1)根据原始问题文本在预置的应用场景中选择任意一个应用 场景作为目标应用场景,预置的应用场景包含多个预先设置的应用场景;Specifically, (1) according to the original question text, select any one of the application scenarios in the preset application scenarios as the target application scenario, and the preset application scenarios include a plurality of preset application scenarios;

(2)获取目标应用场景对应的目标分词标准;(2) Obtain the target word segmentation standard corresponding to the target application scenario;

不同的业务场景的分词标准不同,不同业务场景下的词语可以划分为多 个不同的层次,以及设置每个层次的词语权重。例如,在平安寿险智能问答 系统实际应用中,词语分词标准为六个类别:保险产品实体、疾病名称实体、 地点名称实体、职业名称实体、服务操作类关键词(例如“投保”、“理赔”)、 其他词,不同类别中词语所占的权重不同。Different business scenarios have different word segmentation standards. Words in different business scenarios can be divided into multiple levels, and the weight of words at each level can be set. For example, in the practical application of Ping An Life's intelligent question answering system, the word segmentation criteria are six categories: insurance product entity, disease name entity, location name entity, occupation name entity, service operation keywords (such as "insurance", "claims") ), other words, and words in different categories have different weights.

(3)在目标应用场景下查找与原始问题文本语义相似的相似问题文本。(3) Find similar question texts that are semantically similar to the original question texts in the target application scenario.

其中,目标应用场景下会存在多个相似问题文本,例如,在寿险保险智 能问答系统中,可以包括:“平安福投保”、“平安福怎么投保”和“平安福怎 么缴费”等问题,若原始问题文本为“我要投保平安福”,那么可以将“平安 福投保”、“平安福怎么投保”都确定为相似标准问题文本。Among them, there will be multiple similar question texts in the target application scenario. For example, in the life insurance intelligent question answering system, it may include: "Ping An Fu insurance", "Ping An Fu how to apply for insurance" and "Ping An Fu how to pay" and other questions, if The original question text is "I want to insure Ping An Fu", then both "Ping An Fu insurance" and "How to insure Ping An Fu" can be determined as similar standard question texts.

需要说明的是,同样的一个问题,可能有不同的表达方式,例如,原始 问题文本为:“岗前培训有哪些重点”,与该原始问题文本语义相似的相似问 题文本可以为:“岗前培训的重点内容”或“岗前培训的重点包括哪些?”。 又例如,原始问题文本为:“肠胃炎投保”,与该原始问题文本语义相似的相 似问题文本为:“肠胃炎投保能投保吗”或“急性肠胃炎投保”或“急性肠胃 炎可以买保险吗?”。It should be noted that the same question may be expressed in different ways. For example, the original question text is: "What are the key points of pre-job training?" A similar question text with similar semantics to the original question text can be: "Pre-job training." The key content of training” or “What are the key points of pre-job training?”. For another example, the original question text is: "insurance for gastroenteritis", and the similar question text with similar semantics to the original question text is: "Can I insure for gastroenteritis?" or "insurance for acute gastroenteritis" or "Can I buy insurance for acute gastroenteritis? ?".

103、在多个语义相似的相似问题文本中选择任意一个相似问题文本作为 候选问题文本,并根据目标分词标准提取原始问题文本的词语特征和候选问 题文本的词语特征。103. Select any one of the similar question texts with similar semantics as the candidate question text, and extract the word feature of the original question text and the word feature of the candidate question text according to the target word segmentation standard.

具体的:服务器在多个语义相似的相似问题文本中选择任意一个相似问 题文本作为候选问题文本;服务器基于目标分词标准对原始问题文本和候选 问题文本分别进行分词和命名实体识别,得到原始问题文本的词语特征和候 选问题文本的词语特征。Specifically: the server selects any one of the similar question texts with similar semantics as the candidate question text; the server performs word segmentation and named entity recognition on the original question text and the candidate question text based on the target word segmentation criteria, respectively, to obtain the original question text. and the word features of the candidate question text.

其中,服务器基于目标分词标准对原始问题文本和候选问题文本分别进 行分词和命名实体识别,得到原始问题文本的词语特征和候选问题文本的词 语特征,可以包括:服务器基于目标分词标准对原始问题文本进行分词,得 到原始问题文本的分词结果;服务器基于目标分词标准对候选问题文本进行 分词,得到候选问题文本的分词结果;服务器对原始问题文本的分词结果和 候选问题文本的分词结果分别进行命名实体识别,得到原始问题文本的词语 特征和候选问题文本的词语特征,原始问题文本的词语特征包括标注好的原 始词语和对应的原始词语词性,候选问题文本的词语特征包括标注好的候选 词语和对应的候选词语词性。The server performs word segmentation and named entity recognition on the original question text and the candidate question text based on the target word segmentation standard, respectively, and obtains the word features of the original question text and the word features of the candidate question text, which may include: Perform word segmentation to obtain the word segmentation result of the original question text; the server performs word segmentation on the candidate question text based on the target word segmentation standard, and obtains the word segmentation result of the candidate question text; the server names the entity result of the word segmentation result of the original question text and the word segmentation result of the candidate question text respectively. Identify and obtain the word features of the original question text and the word features of the candidate question text. The word features of the original question text include the marked original words and the corresponding original word parts of speech, and the word features of the candidate question text include the marked candidate words and corresponding original words. candidate word part-of-speech.

例如,对于原始问题文本“肠胃炎投保”而言,选择“肠胃炎投保能投 保吗”作为候选问题文本。需要说明的是,实体识别,关键词识别等可采用 NER模型或者词典等形式,具体此处不做限定。For example, for the original question text "Gastroenteritis insurance", select "Can gastroenteritis insurance be insured" as the candidate question text. It should be noted that, entity recognition, keyword recognition, etc. can be in the form of NER model or dictionary, which is not specifically limited here.

或者,还可以包括:服务器基于目标分词标准对原始问题文本进行分词, 得到原始问题文本的分词结果;服务器获取候选问题文本的预置分词结果, 其中,预置分词结果为根据目标分词标准预先对候选问题文本进行离线分词 的结果;服务器对原始问题文本的分词结果和候选问题文本的预置分词结果 分别进行命名实体识别,得到原始问题文本的词语特征和候选问题文本的词 语特征,原始问题文本的词语特征包括标注好的原始词语和对应的原始词语 词性,候选问题文本的词语特征包括标注好的候选词语和对应的候选词语词 性。Alternatively, it may also include: the server performs word segmentation on the original question text based on the target word segmentation standard, and obtains a word segmentation result of the original question text; the server obtains a preset word segmentation result of the candidate question text, wherein the preset word segmentation result is based on the target word segmentation standard The result of offline word segmentation of the candidate question text; the server performs named entity recognition on the word segmentation result of the original question text and the preset word segmentation result of the candidate question text, respectively, to obtain the word features of the original question text and the word features of the candidate question text. The original question text The word features of , include the labeled original words and the corresponding original word parts of speech, and the word features of the candidate question text include the labeled candidate words and the corresponding candidate word parts of speech.

例如,原始问题文本为:“我要投保平安福”,而候选问题文本为:“平安 福投保”,对原始问题文本和候选问题文本进行预处理,实践操作中可将部 分词语作为停顿词将其去除。问题文本经过预处理后,采用预置的实体识别 模型或关键词识别模型获取原始问题文本和候选问题文本的各个层次的词语 特征。例如,原始问题文本的词语特征为:(我,其他词)(要,其他词)(投 保,关键词)(平安福,保险实体词)。候选问题文本的词语特征为:(平安福, 保险实体词)(投保,关键词)。For example, the original question text is: "I want to insure Ping An Fu", and the candidate question text is: "Ping An Fu insurance", the original question text and the candidate question text are preprocessed, and some words can be used as stop words in practice. its removed. After the question text is preprocessed, the preset entity recognition model or keyword recognition model is used to obtain the word features of the original question text and the candidate question text at various levels. For example, the word features of the original question text are: (I, other words) (to, other words) (insurance, keywords) (Ping Anfu, insurance entity words). The word features of the candidate question text are: (Ping An Fu, insurance entity word) (insurance, keyword).

104、根据原始问题文本的词语特征和候选问题文本的词语特征分别进行 计算,得到正向文本相似度和反向文本相似度。104. Calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity.

服务器将原始问题文本确定为基准问题文本,将候选问题文本确定为匹 配问题文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式 为

Figure BDA0002368232150000101
其中A表示基准问题文本,B表示 匹配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题 文本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下 标的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的 相似度系数,
Figure BDA0002368232150000102
服务器将候选问题文本 确定为基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹 配公式计算得到反向文本相似度。The server determines the original question text as the reference question text, determines the candidate question text as the matching question text, and calculates the forward text similarity based on the preset matching formula. The preset matching formula is:
Figure BDA0002368232150000101
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure BDA0002368232150000102
The server determines the candidate question text as the reference question text, determines the original question text as the matching question text, and calculates the reverse text similarity based on the preset matching formula.

例如,原始问题文本为:“我要投保平安福”,而候选问题文本为:“平安 福投保”;原始问题文本的词语特征为:(我,其他词)(要,其他词)(投保, 关键词)(平安福,保险实体词);候选问题文本的词语特征为:(平安福,保 险实体词)(投保,关键词)。假设实体词(包括)的权重为3,关键词的权重为 2,其他词的权重为1,那么对于原始问题文本A“我要投保平安福”和候选 问题文本B“平安福投保”两个文本的正向匹配结果为:Score(正 向)=1/(1+1+2+3)*max(jaccard(我,平安福),jaccard(我,投 保))+1/(1+1+2+3)*max(jaccard(要,平安福),jaccard(要,投 保))+2/(1+1+2+3)*max(jaccard(投保,平安福),jaccard(投保,投 保))+3/(1+1+2+3)*max(jaccard(平安福,平安福),jaccard(平安福,投 保))=1/8*0+1/8*0+2/8*1+3/8*1=5/8。反向匹配结果为:Score(反 向)=3/(3+2)*max(jaccard(平安福,我),jaccard(平安福,要),jaccard(平安 福,投保),jaccard(平安福,平安福))+2/(3+2)*max(jaccard(投保, 我),jaccard(投保,要),jaccard(投保,投保),jaccard(投保,平安 福))=3/5+2/5=1。For example, the original question text is: "I want to insure Ping An Fu", and the candidate question text is: "Ping An Fu insurance"; the word features of the original question text are: (I, other words) (want, other words) (insurance, Keywords) (Ping An Fu, insurance entity word); word features of the candidate question text are: (Ping An Fu, insurance entity word) (insurance, keyword). Assuming that the weight of entity words (including) is 3, the weight of keywords is 2, and the weight of other words is 1, then for the original question text A "I want to insure Ping An Fu" and the candidate question text B "Ping An Fu insurance" two The positive matching result of the text is: Score(positive)=1/(1+1+2+3)*max(jaccard(me, Ping Anfu), jaccard(me, insurance))+1/(1+1 +2+3)*max(jaccard(required, safe fortune), jaccard(required, insured))+2/(1+1+2+3)*max(jaccard(insured, safe fortune), jaccard(insured, Insurance))+3/(1+1+2+3)*max(jaccard(Ping An Fu, Ping An Fu), jaccard(Ping An Fu, Insurance))=1/8*0+1/8*0+2/ 8*1+3/8*1=5/8. The reverse matching result is: Score(reverse)=3/(3+2)*max(jaccard(Ping An Fu, I), jaccard( Ping An Fu, want), jaccard( Ping An Fu, insured), jaccard( Ping An Fu, insurance) , Ping An Fu))+2/(3+2)*max(jaccard(insurance, me), jaccard(insurance, to), jaccard(insurance, insurance), jaccard(insurance, Ping Anfu)) = 3/5+ 2/5=1.

105、将正向文本相似度和反向文本相似度进行特征融合,生成相似度匹 配分值,相似度匹配分值用于指示原始问题文本与候选问题文本之间的相似 程度。105. Perform feature fusion on the forward text similarity and the reverse text similarity to generate a similarity matching score, and the similarity matching score is used to indicate the similarity between the original question text and the candidate question text.

服务器将正向文本相似度和反向文本相似度进行特征融合,生成相似度 匹配分值,相似度匹配分值用于指示原始问题文本与候选问题文本之间的相 似程度。具体的,服务器通过预置公式将正向文本相似度和反向文本相似度 进行融合,预置公式为:score=w1*score(正向)+w2*score(反向)+b,其中,b 为常数,w1、w2为权重常数;服务器计算得到相似度匹配分值score,所述相 似度匹配分值指示所述原始问题文本与所述候选问题文本之间的相似程度。The server performs feature fusion of the forward text similarity and the reverse text similarity to generate a similarity matching score, which is used to indicate the similarity between the original question text and the candidate question text. Specifically, the server fuses the forward text similarity and the reverse text similarity through a preset formula, and the preset formula is: score=w1*score(forward)+w2*score(reverse)+b, where, b is a constant, w1 and w2 are weight constants; the server calculates a similarity matching score score, and the similarity matching score indicates the degree of similarity between the original question text and the candidate question text.

例如,原始问题文本(用户问题)是:“胃癌能否投保平安福”,对应业 务场景下问题库中包含的候选问题文本1为:“胃癌可以投保平安福吗”,候 选问题文本2为:“胃癌可以投保平安福和爱满分吗”,假设特征融合的权值 分别是:w1=0.6,w2=0.4,b=0,那么计算得到,候选问题文本1的相似度匹 配分值为:score1=0.6*(8/9)+0.4*(8/10)=0.853,候选问题文本2的相似度 匹配分值为:score2=0.6*(8/9)+0.4*(8/14)=0.762。很显然“胃癌能否投保 平安福”和“胃癌可以投保平安福吗”语义上是等价的,“胃癌能否投保平安福”和“胃癌可以投保平安福和爱满分吗”语义上是不等价的;因此对寿险 保险智能问答系统来说,候选问题文本1的相似度匹配分值要大于候选问题 文本2的相似度匹配分值。如果只有正向匹配,两个候选问题文本的得分是 一样的,都是8/9,这是不合理的,因为候选问题文本2的语义上包含了原始 问题文本,不是跟原始问题文本是等价的。同时考虑了正向匹配和反向匹配, 改进了文本相似匹配算法,一定程度上优化文本语义包含的问题。For example, the original question text (user question) is: "Can stomach cancer be insured for Ping An Fu", the candidate question text 1 contained in the question database in the corresponding business scenario is: "Can gastric cancer be insured in Ping An Fu", and the candidate question text 2 is: "Can stomach cancer be insured with Ping An Fu and Love Full Score?" Assuming that the weights of feature fusion are: w1=0.6, w2=0.4, b=0, then it is calculated that the similarity matching score of candidate question text 1 is: score1 =0.6*(8/9)+0.4*(8/10)=0.853, the similarity matching score of candidate question text 2 is: score2=0.6*(8/9)+0.4*(8/14)=0.762 . Obviously, "Can gastric cancer be insured in Ping An Fu" and "Can gastric cancer be insured in Ping An Fu" are semantically equivalent, and "Can gastric cancer be insured in Ping An Fu" and "Can gastric cancer be insured in Ping An Fu and Love Full Score" are semantically different. Equivalent; therefore, for the life insurance intelligent question answering system, the similarity matching score of candidate question text 1 is greater than the similarity matching score of candidate question text 2. If there is only a positive match, the scores of the two candidate question texts are the same, both are 8/9, which is unreasonable, because the semantics of the candidate question text 2 contains the original question text, not the same as the original question text. price. At the same time, forward matching and reverse matching are considered, the text similarity matching algorithm is improved, and the problem of text semantic inclusion is optimized to a certain extent.

106、将多个候选问题文本对应的候选相似度值进行比较,将数值最大的 候选相似度值确定为目标相似度值,并选择目标相似度值对应的候选问题文 本作为标准问题文本。106. Compare the candidate similarity values corresponding to the multiple candidate question texts, determine the candidate similarity value with the largest value as the target similarity value, and select the candidate question text corresponding to the target similarity value as the standard question text.

服务器将多个候选问题文本对应的候选相似度值进行比较,将数值最大 的候选相似度值确定为目标相似度值,并选择目标相似度值对应的候选问题 文本作为标准问题文本。The server compares the candidate similarity values corresponding to multiple candidate question texts, determines the candidate similarity value with the largest value as the target similarity value, and selects the candidate question text corresponding to the target similarity value as the standard question text.

针对每个原始问题文本,会有N个候选问题文本,服务器计算每个候选 问题文本与原始问题文本之间的相似度,按照相似度值的大小进行排序,找 到相似度最高的候选问题文本,并确定为标准问题文本。For each original question text, there will be N candidate question texts. The server calculates the similarity between each candidate question text and the original question text, sorts them according to the similarity value, and finds the candidate question text with the highest similarity. and identified as the standard question text.

本发明实施例,采用基于用户问题的正向匹配和基于标准问题的反向匹 配的方式来分别计算正向文本相似度和反向文本相似度,将正向文本相似度 和反向文本相似度进行融合得到最终的相似度计算值,反映了特定业务场景 下需要优先匹配的词语类型,体现了文本语义包含关系,提高了特定业务场 景下文本相似度值的计算准确性。In the embodiment of the present invention, the forward matching based on user questions and the reverse matching based on standard questions are used to calculate the forward text similarity and the reverse text similarity respectively, and the forward text similarity and the reverse text similarity are calculated. The final similarity calculation value is obtained by fusion, which reflects the type of words that need to be preferentially matched in a specific business scenario, reflects the semantic inclusion relationship of the text, and improves the calculation accuracy of the text similarity value in a specific business scenario.

请参阅图2,本发明实施例提供的基于词语特征的相似度计算方法的另一 个流程图,具体包括:Please refer to Fig. 2, another flowchart of the similarity calculation method based on word feature provided by the embodiment of the present invention specifically includes:

201、获取原始问题文本,原始问题文本用于指示查找原始问题文本对应 的答案。201. Obtain the original question text, where the original question text is used to instruct to find the answer corresponding to the original question text.

服务器获取原始问题文本,原始问题文本用于指示查找原始问题文本对 应的答案。The server obtains the original question text, and the original question text is used to instruct to find the answer corresponding to the original question text.

需要说明的是,对于一个原始问题文本,对应的答案可能有多种,例如, 原始问题文本为:“体检的科目包括哪些?”,对应的答案可能包括下列一个 或多个答案:“耳鼻喉科”、“血常规”、“乙肝五项检查”、“和尿常规检查”、“心 电图”或“B超”等,具体此处不做限定。It should be noted that, for an original question text, there may be multiple corresponding answers, for example, the original question text is: "What subjects are included in the physical examination?", the corresponding answer may include one or more of the following answers: "ENT Department", "Blood Routine", "Five Hepatitis B Tests", "Urine Routine Test", "ECG" or "B-ultrasound", etc., which are not specifically limited here.

可以理解的是,本发明的执行主体可以为基于词语特征的相似度计算装 置,还可以是终端或者服务器,具体此处不做限定。本发明实施例以服务器 为执行主体为例进行说明。It can be understood that the executive body of the present invention may be a similarity calculation device based on word features, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present invention is described by taking the server as the execution subject as an example.

202、根据原始问题文本和预置的应用场景确定目标应用场景,并获取目 标应用场景对应的目标分词标准以及多个语义相似的相似问题文本,预置的 应用场景包含预先设置的多个候选场景。202. Determine a target application scene according to the original question text and a preset application scene, and obtain a target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics, and the preset application scene includes a plurality of preset candidate scenes .

具体的,(1)根据原始问题文本在预置的应用场景中选择任意一个应用 场景作为目标应用场景,预置的应用场景包含多个预先设置的应用场景;Specifically, (1) according to the original question text, select any one of the application scenarios in the preset application scenarios as the target application scenario, and the preset application scenarios include a plurality of preset application scenarios;

(2)获取目标应用场景对应的目标分词标准;(2) Obtain the target word segmentation standard corresponding to the target application scenario;

不同的业务场景的分词标准不同,不同业务场景下的词语可以划分为多 个不同的层次,以及设置每个层次的词语权重。例如,在平安寿险智能问答 系统实际应用中,词语分词标准为六个类别:保险产品实体、疾病名称实体、 地点名称实体、职业名称实体、服务操作类关键词(例如“投保”、“理赔”)、 其他词,不同类别中词语所占的权重不同。Different business scenarios have different word segmentation standards. Words in different business scenarios can be divided into multiple levels, and the weight of words at each level can be set. For example, in the practical application of Ping An Life's intelligent question answering system, the word segmentation criteria are six categories: insurance product entity, disease name entity, location name entity, occupation name entity, service operation keywords (such as "insurance", "claims") ), other words, and words in different categories have different weights.

(3)在目标应用场景下查找与原始问题文本语义相似的相似问题文本。(3) Find similar question texts that are semantically similar to the original question texts in the target application scenario.

其中,目标应用场景下会存在多个相似问题文本,例如,在寿险保险智 能问答系统中,可以包括:“平安福投保”、“平安福怎么投保”和“平安福怎 么缴费”等问题,若原始问题文本为“我要投保平安福”,那么可以将“平安 福投保”、“平安福怎么投保”都确定为相似标准问题文本。Among them, there will be multiple similar question texts in the target application scenario. For example, in the life insurance intelligent question answering system, it may include: "Ping An Fu insurance", "Ping An Fu how to apply for insurance" and "Ping An Fu how to pay" and other questions, if The original question text is "I want to insure Ping An Fu", then both "Ping An Fu insurance" and "How to insure Ping An Fu" can be determined as similar standard question texts.

需要说明的是,同样的一个问题,可能有不同的表达方式,例如,原始 问题文本为:“岗前培训有哪些重点”,与该原始问题文本语义相似的相似问 题文本可以为:“岗前培训的重点内容”或“岗前培训的重点包括哪些?”。 又例如,原始问题文本为:“肠胃炎投保”,与该原始问题文本语义相似的相 似问题文本为:“肠胃炎投保能投保吗”或“急性肠胃炎投保”或“急性肠胃 炎可以买保险吗?”。It should be noted that the same question may be expressed in different ways. For example, the original question text is: "What are the key points of pre-job training?" A similar question text with similar semantics to the original question text can be: "Pre-job training." The key content of training” or “What are the key points of pre-job training?”. For another example, the original question text is: "insurance for gastroenteritis", and the similar question text with similar semantics to the original question text is: "Can I insure for gastroenteritis?" or "insurance for acute gastroenteritis" or "Can I buy insurance for acute gastroenteritis? ?".

203、在多个语义相似的相似问题文本中选择任意一个相似问题文本作为 候选问题文本。203. Select any one of the similar question texts with similar semantics as the candidate question text.

服务器在多个语义相似的相似问题文本中选择任意一个相似问题文本作 为候选问题文本。The server selects any one of the similar question texts with similar semantics as the candidate question text.

例如,对于原始问题文本“肠胃炎投保”而言,与该原始问题文本语义 相似的相似问题文本为:“肠胃炎投保能投保吗”或“急性肠胃炎投保”或“急 性肠胃炎可以买保险吗?”,选择“肠胃炎投保能投保吗”作为候选问题文本。For example, for the original question text "insurance for gastroenteritis", similar question texts with similar semantics to the original question text are: "Can I insure for gastroenteritis" or "Insurance for acute gastroenteritis" or "Can I buy insurance for acute gastroenteritis" Is it possible?", select "Can I insure gastroenteritis" as the candidate question text.

204、基于目标分词标准对原始问题文本和候选问题文本分别进行分词和 命名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征。204. Perform word segmentation and named entity recognition on the original question text and the candidate question text based on the target word segmentation standard, respectively, to obtain word features of the original question text and word features of the candidate question text.

具体的,服务器基于目标分词标准对原始问题文本进行分词,得到原始 问题文本的分词结果;服务器基于目标分词标准对候选问题文本进行分词, 得到候选问题文本的分词结果;服务器对原始问题文本的分词结果和候选问 题文本的分词结果分别进行命名实体识别,得到原始问题文本的词语特征和 候选问题文本的词语特征,原始问题文本的词语特征包括标注好的原始词语 和对应的原始词语词性,候选问题文本的词语特征包括标注好的候选词语和 对应的候选词语词性。Specifically, the server performs word segmentation on the original question text based on the target word segmentation standard to obtain a word segmentation result of the original question text; the server performs word segmentation on the candidate question text based on the target word segmentation standard to obtain the word segmentation result of the candidate question text; the server performs word segmentation on the original question text The result and the word segmentation result of the candidate question text are respectively subjected to named entity recognition, and the word features of the original question text and the word features of the candidate question text are obtained. The word features of the original question text include the marked original words and the corresponding original words. The word features of the text include the marked candidate words and the corresponding part of speech of the candidate words.

或者,服务器基于目标分词标准对原始问题文本进行分词,得到原始问 题文本的分词结果;服务器获取候选问题文本的预置分词结果,其中,预置 分词结果为根据目标分词标准预先对候选问题文本进行离线分词的结果;服 务器对原始问题文本的分词结果和候选问题文本的预置分词结果分别进行命 名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征,原 始问题文本的词语特征包括标注好的原始词语和对应的原始词语词性,候选 问题文本的词语特征包括标注好的候选词语和对应的候选词语词性。Or, the server performs word segmentation on the original question text based on the target word segmentation standard, and obtains the word segmentation result of the original question text; the server obtains the preset word segmentation result of the candidate question text, wherein the preset word segmentation result is based on the target word segmentation standard. The result of offline word segmentation; the server performs named entity recognition on the word segmentation result of the original question text and the preset word segmentation result of the candidate question text, respectively, and obtains the word features of the original question text and the word features of the candidate question text. The word features of the original question text include: The labeled original words and the corresponding original word parts of speech, and the word features of the candidate question text include the labeled candidate words and the corresponding candidate word parts of speech.

例如,原始问题文本为:“我要投保平安福”,而候选问题文本为:“平安 福投保”,对原始问题文本和候选问题文本进行预处理,实践操作中可将部 分词语作为停顿词将其去除。问题文本经过预处理后,采用预置的实体识别 模型或关键词识别模型获取原始问题文本和候选问题文本的各个层次的词语 特征。例如,原始问题文本的词语特征为:(我,其他词)(要,其他词)(投 保,关键词)(平安福,保险实体词)。候选问题文本的词语特征为:(平安福, 保险实体词)(投保,关键词)。For example, the original question text is: "I want to insure Ping An Fu", and the candidate question text is: "Ping An Fu insurance", the original question text and the candidate question text are preprocessed, and some words can be used as stop words in practice. its removed. After the question text is preprocessed, the preset entity recognition model or keyword recognition model is used to obtain the word features of the original question text and the candidate question text at various levels. For example, the word features of the original question text are: (I, other words) (to, other words) (insurance, keywords) (Ping Anfu, insurance entity words). The word features of the candidate question text are: (Ping An Fu, insurance entity word) (insurance, keyword).

205、根据原始问题文本的词语特征和候选问题文本的词语特征分别进行 计算,得到正向文本相似度和反向文本相似度。205. Calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity.

服务器将原始问题文本确定为基准问题文本,将候选问题文本确定为匹 配问题文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式 为

Figure BDA0002368232150000141
其中A表示基准问题文本,B表示 匹配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题 文本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下 标的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的 相似度系数,
Figure BDA0002368232150000142
服务器将候选问题文本 确定为基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹 配公式计算得到反向文本相似度。The server determines the original question text as the reference question text, determines the candidate question text as the matching question text, and calculates the forward text similarity based on the preset matching formula. The preset matching formula is:
Figure BDA0002368232150000141
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure BDA0002368232150000142
The server determines the candidate question text as the reference question text, determines the original question text as the matching question text, and calculates the reverse text similarity based on the preset matching formula.

例如,原始问题文本为:“我要投保平安福”,而候选问题文本为:“平安 福投保”;原始问题文本的词语特征为:(我,其他词)(要,其他词)(投保, 关键词)(平安福,保险实体词);候选问题文本的词语特征为:(平安福,保 险实体词)(投保,关键词)。假设实体词(包括)的权重为3,关键词的权重为 2,其他词的权重为1,那么对于原始问题文本A“我要投保平安福”和候选 问题文本B“平安福投保”两个文本的正向匹配结果为:Score(正 向)=1/(1+1+2+3)*max(jaccard(我,平安福),jaccard(我,投 保))+1/(1+1+2+3)*max(jaccard(要,平安福),jaccard(要,投 保))+2/(1+1+2+3)*max(jaccard(投保,平安福),jaccard(投保,投 保))+3/(1+1+2+3)*max(jaccard(平安福,平安福),jaccard(平安福,投 保))=1/8*0+1/8*0+2/8*1+3/8*1=5/8。反向匹配结果为:Score(反 向)=3/(3+2)*max(jaccard(平安福,我),jaccard(平安福,要),jaccard(平安 福,投保),jaccard(平安福,平安福))+2/(3+2)*max(jaccard(投保, 我),jaccard(投保,要),jaccard(投保,投保),jaccard(投保,平安 福))=3/5+2/5=1。For example, the original question text is: "I want to insure Ping An Fu", and the candidate question text is: "Ping An Fu insurance"; the word features of the original question text are: (I, other words) (want, other words) (insurance, Keywords) (Ping An Fu, insurance entity word); word features of the candidate question text are: (Ping An Fu, insurance entity word) (insurance, keyword). Assuming that the weight of entity words (including) is 3, the weight of keywords is 2, and the weight of other words is 1, then for the original question text A "I want to insure Ping An Fu" and the candidate question text B "Ping An Fu insurance" two The positive matching result of the text is: Score(positive)=1/(1+1+2+3)*max(jaccard(me, Ping Anfu), jaccard(me, insurance))+1/(1+1 +2+3)*max(jaccard(required, safe fortune), jaccard(required, insured))+2/(1+1+2+3)*max(jaccard(insured, safe fortune), jaccard(insured, Insurance))+3/(1+1+2+3)*max(jaccard(Ping An Fu, Ping An Fu), jaccard(Ping An Fu, Insurance))=1/8*0+1/8*0+2/ 8*1+3/8*1=5/8. The reverse matching result is: Score(reverse)=3/(3+2)*max(jaccard(Ping An Fu, I), jaccard( Ping An Fu, want), jaccard( Ping An Fu, insured), jaccard( Ping An Fu, insurance) , Ping An Fu))+2/(3+2)*max(jaccard(insurance, me), jaccard(insurance, to), jaccard(insurance, insurance), jaccard(insurance, Ping Anfu)) = 3/5+ 2/5=1.

206、将正向文本相似度和反向文本相似度进行特征融合,生成相似度匹 配分值,相似度匹配分值用于指示原始问题文本与候选问题文本之间的相似 程度。206. Perform feature fusion on the forward text similarity and the reverse text similarity to generate a similarity matching score, where the similarity matching score is used to indicate the degree of similarity between the original question text and the candidate question text.

服务器将正向文本相似度和反向文本相似度进行特征融合,生成相似度 匹配分值,相似度匹配分值用于指示原始问题文本与候选问题文本之间的相 似程度。具体的,服务器通过预置公式将正向文本相似度和反向文本相似度 进行融合,预置公式为:score=w1*score(正向)+w2*score(反向)+b,其中,b 为常数,w1、w2为权重常数;服务器计算得到相似度匹配分值score,所述相 似度匹配分值指示所述原始问题文本与所述候选问题文本之间的相似程度。The server performs feature fusion of the forward text similarity and the reverse text similarity to generate a similarity matching score, which is used to indicate the similarity between the original question text and the candidate question text. Specifically, the server fuses the forward text similarity and the reverse text similarity through a preset formula, and the preset formula is: score=w1*score(forward)+w2*score(reverse)+b, where, b is a constant, w1 and w2 are weight constants; the server calculates a similarity matching score score, and the similarity matching score indicates the degree of similarity between the original question text and the candidate question text.

例如,原始问题文本(用户问题)是:“胃癌能否投保平安福”,对应业 务场景下问题库中包含的候选问题文本1为:“胃癌可以投保平安福吗”,候 选问题文本2为:“胃癌可以投保平安福和爱满分吗”,假设特征融合的权值 分别是:w1=0.6,w2=0.4,b=0,那么计算得到,候选问题文本1的相似度匹 配分值为:score1=0.6*(8/9)+0.4*(8/10)=0.853,候选问题文本2的相似度 匹配分值为:score2=0.6*(8/9)+0.4*(8/14)=0.762。很显然“胃癌能否投保 平安福”和“胃癌可以投保平安福吗”语义上是等价的,“胃癌能否投保平安福”和“胃癌可以投保平安福和爱满分吗”语义上是不等价的;因此对寿险 保险智能问答系统来说,候选问题文本1的相似度匹配分值要大于候选问题 文本2的相似度匹配分值。如果只有正向匹配,两个候选问题文本的得分是 一样的,都是8/9,这是不合理的,因为候选问题文本2的语义上包含了原始 问题文本,不是跟原始问题文本是等价的。同时考虑了正向匹配和反向匹配, 改进了文本相似匹配算法,一定程度上优化文本语义包含的问题。For example, the original question text (user question) is: "Can stomach cancer be insured for Ping An Fu", the candidate question text 1 contained in the question database in the corresponding business scenario is: "Can gastric cancer be insured in Ping An Fu", and the candidate question text 2 is: "Can stomach cancer be insured with Ping An Fu and Love Full Score?" Assuming that the weights of feature fusion are: w1=0.6, w2=0.4, b=0, then it is calculated that the similarity matching score of candidate question text 1 is: score1 =0.6*(8/9)+0.4*(8/10)=0.853, the similarity matching score of candidate question text 2 is: score2=0.6*(8/9)+0.4*(8/14)=0.762 . Obviously, "Can gastric cancer be insured in Ping An Fu" and "Can gastric cancer be insured in Ping An Fu" are semantically equivalent, and "Can gastric cancer be insured in Ping An Fu" and "Can gastric cancer be insured in Ping An Fu and Love Full Score" are semantically different. Equivalent; therefore, for the life insurance intelligent question answering system, the similarity matching score of candidate question text 1 is greater than the similarity matching score of candidate question text 2. If there is only a positive match, the scores of the two candidate question texts are the same, both are 8/9, which is unreasonable, because the semantics of the candidate question text 2 contains the original question text, not the same as the original question text. price. At the same time, forward matching and reverse matching are considered, the text similarity matching algorithm is improved, and the problem of text semantic inclusion is optimized to a certain extent.

207、将多个候选问题文本对应的候选相似度值进行比较,将数值最大的 候选相似度值确定为目标相似度值,并选择目标相似度值对应的候选问题文 本作为标准问题文本。207. Compare the candidate similarity values corresponding to the multiple candidate question texts, determine the candidate similarity value with the largest value as the target similarity value, and select the candidate question text corresponding to the target similarity value as the standard question text.

服务器将多个候选问题文本对应的候选相似度值进行比较,将数值最大 的候选相似度值确定为目标相似度值,并选择目标相似度值对应的候选问题 文本作为标准问题文本。The server compares the candidate similarity values corresponding to multiple candidate question texts, determines the candidate similarity value with the largest value as the target similarity value, and selects the candidate question text corresponding to the target similarity value as the standard question text.

针对每个原始问题文本,会有N个候选问题文本,服务器计算每个候选 问题文本与原始问题文本之间的相似度,按照相似度值的大小进行排序,找 到相似度最高的候选问题文本,并确定为标准问题文本。For each original question text, there will be N candidate question texts. The server calculates the similarity between each candidate question text and the original question text, sorts them according to the similarity value, and finds the candidate question text with the highest similarity. and identified as the standard question text.

本发明实施例,采用基于用户问题的正向匹配和基于标准问题的反向匹 配的方式来分别计算正向文本相似度和反向文本相似度,将正向文本相似度 和反向文本相似度进行融合得到最终的相似度计算值,反映了特定业务场景 下需要优先匹配的词语类型,体现了文本语义包含关系,提高了特定业务场 景下文本相似度值的计算准确性。In the embodiment of the present invention, the forward matching based on user questions and the reverse matching based on standard questions are used to calculate the forward text similarity and the reverse text similarity respectively, and the forward text similarity and the reverse text similarity are calculated. The final similarity calculation value is obtained by fusion, which reflects the type of words that need to be preferentially matched in a specific business scenario, reflects the semantic inclusion relationship of the text, and improves the calculation accuracy of the text similarity value in a specific business scenario.

上面对本发明实施例中基于词语特征的相似度计算方法进行了描述,下 面对本发明实施例中基于词语特征的相似度计算装置进行描述,请参阅图3, 本发明实施例中基于词语特征的相似度计算装置的一个实施例包括:The method for calculating similarity based on word features in the embodiment of the present invention has been described above. The following describes the device for calculating similarity based on word features in the embodiment of the present invention. Please refer to FIG. 3 . One embodiment of a degree computing device includes:

获取单元301,用于获取原始问题文本,所述原始问题文本用于指示查找 所述原始问题文本对应的答案;Obtaining unit 301, for obtaining original question text, and described original question text is used for instructing to find the answer corresponding to described original question text;

确定单元302,用于根据所述原始问题文本和预置的应用场景确定目标应 用场景,并获取所述目标应用场景对应的目标分词标准以及多个语义相似的 相似问题文本,所述预置的应用场景包含预先设置的多个候选场景;The determining unit 302 is configured to determine a target application scene according to the original question text and a preset application scene, and obtain a target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics. The application scenario includes multiple preset candidate scenarios;

选择提取单元303,用于在所述多个语义相似的相似问题文本中选择任意 一个相似问题文本作为候选问题文本,并根据所述目标分词标准提取所述原 始问题文本的词语特征和所述候选问题文本的词语特征;The selection and extraction unit 303 is configured to select any one of the similar question texts with similar semantics as the candidate question text, and extract the word feature of the original question text and the candidate question text according to the target word segmentation standard word features of the question text;

计算单元304,用于根据所述原始问题文本的词语特征和所述候选问题文 本的词语特征分别进行计算,得到正向文本相似度和反向文本相似度;The calculation unit 304 is used to calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity;

生成单元305,用于将所述正向文本相似度和所述反向文本相似度进行特 征融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题 文本与所述候选问题文本之间的相似程度;The generating unit 305 is configured to perform feature fusion between the forward text similarity and the reverse text similarity to generate a similarity matching score, and the similarity matching score is used to indicate that the original question text is the same as the original question text. Describe the degree of similarity between candidate question texts;

比较选择单元306,用于将所述多个候选问题文本对应的候选相似度值进 行比较,将数值最大的候选相似度值确定为目标相似度值,并选择所述目标 相似度值对应的候选问题文本作为标准问题文本。The comparison and selection unit 306 is configured to compare the candidate similarity values corresponding to the plurality of candidate question texts, determine the candidate similarity value with the largest numerical value as the target similarity value, and select the candidate corresponding to the target similarity value Question text as standard question text.

本发明实施例,采用基于用户问题的正向匹配和基于标准问题的反向匹 配的方式来分别计算正向文本相似度和反向文本相似度,将正向文本相似度 和反向文本相似度进行融合得到最终的相似度计算值,反映了特定业务场景 下需要优先匹配的词语类型,体现了文本语义包含关系,提高了特定业务场 景下文本相似度值的计算准确性。In the embodiment of the present invention, the forward matching based on user questions and the reverse matching based on standard questions are used to calculate the forward text similarity and the reverse text similarity respectively, and the forward text similarity and the reverse text similarity are calculated. The final similarity calculation value is obtained by fusion, which reflects the type of words that need to be preferentially matched in a specific business scenario, reflects the semantic inclusion relationship of the text, and improves the calculation accuracy of the text similarity value in a specific business scenario.

请参阅图4,本发明实施例中基于词语特征的相似度计算装置的另一个实 施例包括:Referring to Fig. 4, another embodiment of the apparatus for calculating similarity based on word features in the embodiment of the present invention includes:

获取单元301,用于获取原始问题文本,所述原始问题文本用于指示查找 所述原始问题文本对应的答案;Obtaining unit 301, for obtaining original question text, and described original question text is used for instructing to find the answer corresponding to described original question text;

确定单元302,用于根据所述原始问题文本和预置的应用场景确定目标应 用场景,并获取所述目标应用场景对应的目标分词标准以及多个语义相似的 相似问题文本,所述预置的应用场景包含预先设置的多个候选场景;The determining unit 302 is configured to determine a target application scene according to the original question text and a preset application scene, and obtain a target word segmentation standard corresponding to the target application scene and a plurality of similar question texts with similar semantics. The application scenario includes multiple preset candidate scenarios;

选择提取单元303,用于在所述多个语义相似的相似问题文本中选择任意 一个相似问题文本作为候选问题文本,并根据所述目标分词标准提取所述原 始问题文本的词语特征和所述候选问题文本的词语特征;The selection and extraction unit 303 is configured to select any one of the similar question texts with similar semantics as the candidate question text, and extract the word feature of the original question text and the candidate question text according to the target word segmentation standard word features of the question text;

计算单元304,用于根据所述原始问题文本的词语特征和所述候选问题文 本的词语特征分别进行计算,得到正向文本相似度和反向文本相似度;The calculation unit 304 is used to calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity;

生成单元305,用于将所述正向文本相似度和所述反向文本相似度进行特 征融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题 文本与所述候选问题文本之间的相似程度;The generating unit 305 is configured to perform feature fusion between the forward text similarity and the reverse text similarity to generate a similarity matching score, and the similarity matching score is used to indicate that the original question text is the same as the original question text. Describe the degree of similarity between candidate question texts;

比较选择单元306,用于将所述多个候选问题文本对应的候选相似度值进 行比较,将数值最大的候选相似度值确定为目标相似度值,并选择所述目标 相似度值对应的候选问题文本作为标准问题文本。The comparison and selection unit 306 is configured to compare the candidate similarity values corresponding to the plurality of candidate question texts, determine the candidate similarity value with the largest numerical value as the target similarity value, and select the candidate corresponding to the target similarity value Question text as standard question text.

可选的,确定单元302具体用于:Optionally, the determining unit 302 is specifically configured to:

根据所述原始问题文本在预置的应用场景中选择任意一个应用场景作为 目标应用场景,所述预置的应用场景包含多个预先设置的应用场景;获取所 述目标应用场景对应的目标分词标准;在所述目标应用场景下查找与所述原 始问题文本语义相似的相似问题文本。According to the original question text, any one of the preset application scenarios is selected as the target application scenario, and the preset application scenario includes a plurality of preset application scenarios; the target word segmentation standard corresponding to the target application scenario is obtained. ; Search for similar question texts that are semantically similar to the original question texts in the target application scenario.

可选的,选择提取单元303包括:Optionally, the selection extraction unit 303 includes:

选择模块3031,用于在所述多个语义相似的相似问题文本中选择任意一 个相似问题文本作为候选问题文本;The selection module 3031 is used to select any one of the similar question texts with similar semantics as the candidate question text;

分词识别模块3032,用于基于所述目标分词标准对所述原始问题文本和 所述候选问题文本分别进行分词和命名实体识别,得到所述原始问题文本的 词语特征和所述候选问题文本的词语特征。A word segmentation recognition module 3032, configured to perform word segmentation and named entity recognition on the original question text and the candidate question text based on the target word segmentation criteria, respectively, to obtain word features of the original question text and words of the candidate question text feature.

可选的,分词识别模块3032具体用于:Optionally, the word segmentation recognition module 3032 is specifically used for:

基于所述目标分词标准对所述原始问题文本进行分词,得到原始问题文 本的分词结果;基于所述目标分词标准对候选问题文本进行分词,得到候选 问题文本的分词结果;对所述原始问题文本的分词结果和所述候选问题文本 的分词结果分别进行命名实体识别,得到原始问题文本的词语特征和候选问 题文本的词语特征,所述原始问题文本的词语特征包括标注好的原始词语和 对应的原始词语词性,所述候选问题文本的词语特征包括标注好的候选词语 和对应的候选词语词性。Perform word segmentation on the original question text based on the target word segmentation standard to obtain the word segmentation result of the original question text; perform word segmentation on the candidate question text based on the target word segmentation standard to obtain the word segmentation result of the candidate question text; The word segmentation result of the candidate question text and the word segmentation result of the candidate question text are respectively subjected to named entity recognition, and the word features of the original question text and the word features of the candidate question text are obtained. The word features of the original question text include the marked original words and corresponding The original word part of speech, the word features of the candidate question text include the marked candidate words and the corresponding candidate word parts of speech.

可选的,分词识别模块3032具体还用于:Optionally, the word segmentation recognition module 3032 is also specifically used for:

基于所述目标分词标准对原始问题文本进行分词,得到原始问题文本的 分词结果;获取所述候选问题文本的预置分词结果,其中,所述预置分词结 果为根据所述目标分词标准预先对候选问题文本进行离线分词的结果;对所 述原始问题文本的分词结果和所述候选问题文本的预置分词结果分别进行命 名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征,所 述原始问题文本的词语特征包括标注好的原始词语和对应的原始词语词性, 所述候选问题文本的词语特征包括标注好的候选词语和对应的候选词语词 性。Perform word segmentation on the original question text based on the target word segmentation standard to obtain a word segmentation result of the original question text; obtain a preset word segmentation result of the candidate question text, wherein the preset word segmentation result is a pre-set word segmentation result according to the target word segmentation standard The result of offline word segmentation of the candidate question text; the named entity recognition is respectively performed on the word segmentation result of the original question text and the preset word segmentation result of the candidate question text, and the word features of the original question text and the word features of the candidate question text are obtained, The word features of the original question text include the marked original words and the corresponding original word parts of speech, and the word features of the candidate question text include the marked candidate words and the corresponding candidate word parts of speech.

可选的,计算单元304具体用于:Optionally, the computing unit 304 is specifically used for:

将原始问题文本确定为基准问题文本,将候选问题文本确定为匹配问题 文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式为,

Figure RE-GDA0002430442710000181
其中A表示基准问题文本,B表示匹 配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题文 本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下标 的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的 相似度系数,
Figure RE-GDA0002430442710000182
将候选问题文本确定为 基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹配公式 计算得到反向文本相似度。Determine the original question text as the reference question text, determine the candidate question text as the matching question text, and calculate the forward text similarity based on the preset matching formula. The preset matching formula is,
Figure RE-GDA0002430442710000181
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure RE-GDA0002430442710000182
The candidate question text is determined as the reference question text, the original question text is determined as the matching question text, and the reverse text similarity is calculated based on the preset matching formula.

可选的,生成单元305具体用于:Optionally, the generating unit 305 is specifically used for:

通过预置公式将正向文本相似度和反向文本相似度进行融合,预置公式 为:score=w1*score(正向)+w2*score(反向)+b,其中,b为常数,w1、w2为权 重常数;计算得到相似度匹配分值score,所述相似度匹配分值指示所述原始 问题文本与所述候选问题文本之间的相似程度。The forward text similarity and reverse text similarity are fused by the preset formula. The preset formula is: score=w1*score(forward)+w2*score(reverse)+b, where b is a constant, w1 and w2 are weight constants; the similarity matching score score is calculated and obtained, and the similarity matching score indicates the degree of similarity between the original question text and the candidate question text.

本发明实施例,采用基于用户问题的正向匹配和基于标准问题的反向匹 配的方式来分别计算正向文本相似度和反向文本相似度,将正向文本相似度 和反向文本相似度进行融合得到最终的相似度计算值,反映了特定业务场景 下需要优先匹配的词语类型,体现了文本语义包含关系,提高了特定业务场 景下文本相似度值的计算准确性。In the embodiment of the present invention, the forward matching based on user questions and the reverse matching based on standard questions are used to calculate the forward text similarity and the reverse text similarity respectively, and the forward text similarity and the reverse text similarity are calculated. The final similarity calculation value is obtained by fusion, which reflects the type of words that need to be preferentially matched in a specific business scenario, reflects the semantic inclusion relationship of the text, and improves the calculation accuracy of the text similarity value in a specific business scenario.

上面图3至图4从模块化功能实体的角度对本发明实施例中的基于词语 特征的相似度计算装置进行详细描述,下面从硬件处理的角度对本发明实施 例中基于词语特征的相似度计算设备进行详细描述。Figures 3 to 4 above describe in detail the word feature-based similarity computing device in the embodiment of the present invention from the perspective of modular functional entities, and the following describes the word feature-based similarity computing device in the embodiment of the present invention from the perspective of hardware processing. Describe in detail.

图5是本发明实施例提供的一种基于词语特征的相似度计算设备的结构 示意图,该基于词语特征的相似度计算设备500可因配置或性能不同而产生 比较大的差异,可以包括一个或一个以上处理器(central processing units, CPU)501(例如,一个或一个以上处理器)和存储器509,一个或一个以上存 储应用程序507或数据506的存储介质508(例如一个或一个以上海量存储设 备)。其中,存储器509和存储介质508可以是短暂存储或持久存储。存储在 存储介质508的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对基于词语特征的相似度计算设备中的一系列指令操作。更进一步 地,处理器501可以设置为与存储介质508通信,在基于词语特征的相似度 计算设备500上执行存储介质508中的一系列指令操作。FIG. 5 is a schematic structural diagram of a word feature-based similarity computing device provided by an embodiment of the present invention. The word feature-based similarity computing device 500 may vary greatly due to different configurations or performances, and may include one or more One or more central processing units (CPUs) 501 (eg, one or more processors) and memory 509, one or more storage media 508 (eg, one or more mass storage devices) that store applications 507 or data 506 ). Among them, the memory 509 and the storage medium 508 may be short-term storage or persistent storage. The program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the word feature-based similarity computing device. Furthermore, the processor 501 may be configured to communicate with the storage medium 508 to execute a series of instruction operations in the storage medium 508 on the word feature-based similarity computing device 500.

基于词语特征的相似度计算设备500还可以包括一个或一个以上电源 502,一个或一个以上有线或无线网络接口503,一个或一个以上输入输出接 口504,和/或,一个或一个以上操作系统505,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5中示出的基 于词语特征的相似度计算设备结构并不构成对基于词语特征的相似度计算设 备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不 同的部件布置。处理器501可以执行上述实施例中获取单元301、确定单元302、选择提取单元303、计算单元304、生成单元305和比较选择单元306 的功能。The word feature-based similarity computing device 500 may also include one or more power supplies 502 , one or more wired or wireless network interfaces 503 , one or more input and output interfaces 504 , and/or, one or more operating systems 505 , such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the word feature-based similarity computing device shown in FIG. 5 does not constitute a limitation to the word feature-based similarity computing device, and may include more or less components than those shown in the figure. Either some components are combined, or different component arrangements. The processor 501 may perform the functions of the acquisition unit 301, the determination unit 302, the selection extraction unit 303, the calculation unit 304, the generation unit 305, and the comparison selection unit 306 in the above embodiments.

下面结合图5对基于词语特征的相似度计算设备的各个构成部件进行具 体的介绍:Below in conjunction with Fig. 5, each constituent element of the similarity computing device based on word feature is specifically introduced:

处理器501是基于词语特征的相似度计算设备的控制中心,可以按照设 置的基于词语特征的相似度计算方法进行处理。处理器501利用各种接口和 线路连接整个基于词语特征的相似度计算设备的各个部分,通过运行或执行 存储在存储器509内的软件程序和/或模块,以及调用存储在存储器509内的 数据,执行基于词语特征的相似度计算设备的各种功能和处理数据,从而提 高了关键词的召回率,提高了每个关键词的综合分数,提高了关键词抽取的 准确率。存储介质508和存储器509都是存储数据的载体,本发明实施例中,存储介质508可以是指储存容量较小,但速度快的内存储器,而存储器509 可以是储存容量大,但储存速度慢的外存储器。The processor 501 is the control center of the word feature-based similarity calculation device, and can perform processing according to the set word feature-based similarity calculation method. The processor 501 uses various interfaces and lines to connect various parts of the entire word feature-based similarity computing device, and by running or executing the software programs and/or modules stored in the memory 509, and calling the data stored in the memory 509, Various functions and processing data of the similarity calculation device based on word features are performed, thereby improving the recall rate of keywords, improving the comprehensive score of each keyword, and improving the accuracy of keyword extraction. Both the storage medium 508 and the memory 509 are carriers for storing data. In this embodiment of the present invention, the storage medium 508 may refer to an internal memory with a small storage capacity but a fast speed, and the memory 509 may have a large storage capacity but a slow storage speed. of external memory.

存储器509可用于存储软件程序以及模块,处理器501通过运行存储在 存储器509的软件程序以及模块,从而执行基于词语特征的相似度计算设备 500的各种功能应用以及数据处理。存储器509可主要包括存储程序区和存储 数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序 (比如根据原始问题文本的词语特征和候选问题文本的词语特征分别进行计 算,得到正向文本相似度和反向文本相似度)等;存储数据区可存储根据基 于词语特征的相似度计算设备的使用所创建的数据(比如相似度匹配分值等)等。此外,存储器509可以包括高速随机存取存储器,还可以包括非易失性 存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储 器件。在本发明实施例中提供的基于词语特征的相似度计算方法程序和接收 到的数据流存储在存储器中,当需要使用时,处理器501从存储器509中调 用。The memory 509 can be used to store software programs and modules, and the processor 501 executes various functional applications and data processing of the word feature-based similarity computing device 500 by running the software programs and modules stored in the memory 509. The memory 509 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (for example, calculation is performed according to the word feature of the original question text and the word feature of the candidate question text, respectively). , to obtain forward text similarity and reverse text similarity), etc.; the storage data area can store data (such as similarity matching scores, etc.) created according to the use of the similarity computing device based on word features. Additionally, memory 509 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. The program of the word feature-based similarity calculation method provided in the embodiment of the present invention and the received data stream are stored in the memory, and when needed, the processor 501 calls from the memory 509.

以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制; 尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应 当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其 中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案 的本质脱离本发明各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1.一种基于词语特征的相似度计算方法,其特征在于,包括:1. a similarity calculation method based on word feature, is characterized in that, comprises: 获取原始问题文本,所述原始问题文本用于指示查找所述原始问题文本对应的答案;Obtaining the original question text, the original question text is used for instructing to find the answer corresponding to the original question text; 根据所述原始问题文本和预置的应用场景确定目标应用场景,并获取所述目标应用场景对应的目标分词标准以及多个语义相似的相似问题文本,所述预置的应用场景包含预先设置的多个候选场景;Determine a target application scenario according to the original question text and a preset application scenario, and obtain the target word segmentation standard corresponding to the target application scenario and a plurality of similar question texts with similar semantics, where the preset application scenario includes preset application scenarios. Multiple candidate scenarios; 在所述多个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本,并根据所述目标分词标准提取所述原始问题文本的词语特征和所述候选问题文本的词语特征;Selecting any one of the similar question texts with similar semantics as the candidate question text, and extracting the word feature of the original question text and the word feature of the candidate question text according to the target word segmentation standard; 根据所述原始问题文本的词语特征和所述候选问题文本的词语特征分别进行计算,得到正向文本相似度和反向文本相似度;Calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity; 将所述正向文本相似度和所述反向文本相似度进行特征融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题文本与所述候选问题文本之间的相似程度;The forward text similarity and the reverse text similarity are feature fusion to generate a similarity matching score, and the similarity matching score is used to indicate the difference between the original question text and the candidate question text. degree of similarity; 将所述多个候选问题文本对应的候选相似度值进行比较,将数值最大的候选相似度值确定为目标相似度值,并选择所述目标相似度值对应的候选问题文本作为标准问题文本。The candidate similarity values corresponding to the plurality of candidate question texts are compared, the candidate similarity value with the largest numerical value is determined as the target similarity value, and the candidate question text corresponding to the target similarity value is selected as the standard question text. 2.根据权利要求1所述的基于词语特征的相似度计算方法,其特征在于,所述根据所述原始问题文本和预置的应用场景确定目标应用场景,并获取所述目标应用场景对应的目标分词标准以及多个语义相似的相似问题文本,所述预置的应用场景包含预先设置的多个候选场景,包括:2. The word feature-based similarity calculation method according to claim 1, wherein the target application scene is determined according to the original question text and a preset application scene, and the corresponding target application scene is obtained. The target word segmentation standard and multiple similar question texts with similar semantics, and the preset application scenarios include multiple preset candidate scenarios, including: 根据所述原始问题文本在预置的应用场景中选择任意一个应用场景作为目标应用场景,所述预置的应用场景包含多个预先设置的应用场景;According to the original question text, any one of the preset application scenarios is selected as the target application scenario, and the preset application scenario includes a plurality of preset application scenarios; 获取所述目标应用场景对应的目标分词标准;Obtain the target word segmentation standard corresponding to the target application scenario; 在所述目标应用场景下查找与所述原始问题文本语义相似的相似问题文本。Search for similar question texts that are semantically similar to the original question texts in the target application scenario. 3.根据权利要求1所述的基于词语特征的相似度计算方法,其特征在于,所述在所述多个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本,并根据所述目标分词标准提取所述原始问题文本的词语特征和所述候选问题文本的词语特征,包括:3. The word feature-based similarity calculation method according to claim 1, characterized in that, selecting any one similar question text from the plurality of semantically similar similar question texts as the candidate question text, and according to the Extract the word features of the original question text and the word features of the candidate question text using the target word segmentation criteria, including: 在所述多个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本;selecting any one of the similar question texts with similar semantics as the candidate question text; 基于所述目标分词标准对所述原始问题文本和所述候选问题文本分别进行分词和命名实体识别,得到所述原始问题文本的词语特征和所述候选问题文本的词语特征。Based on the target word segmentation criteria, word segmentation and named entity recognition are performed on the original question text and the candidate question text, respectively, to obtain word features of the original question text and word features of the candidate question text. 4.根据权利要求3所述的基于词语特征的相似度计算方法,其特征在于,所述基于所述目标分词标准对所述原始问题文本和所述候选问题文本分别进行分词和命名实体识别,得到所述原始问题文本的词语特征和所述候选问题文本的词语特征,包括:4. The similarity calculation method based on word features according to claim 3, wherein the original question text and the candidate question text are respectively subjected to word segmentation and named entity recognition based on the target word segmentation criteria, The word features of the original question text and the word features of the candidate question text are obtained, including: 基于所述目标分词标准对所述原始问题文本进行分词,得到原始问题文本的分词结果;Perform word segmentation on the original question text based on the target word segmentation standard to obtain a word segmentation result of the original question text; 基于所述目标分词标准对候选问题文本进行分词,得到候选问题文本的分词结果;Perform word segmentation on the candidate question text based on the target word segmentation standard to obtain a word segmentation result of the candidate question text; 对所述原始问题文本的分词结果和所述候选问题文本的分词结果分别进行命名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征,所述原始问题文本的词语特征包括标注好的原始词语和对应的原始词语词性,所述候选问题文本的词语特征包括标注好的候选词语和对应的候选词语词性。Perform named entity recognition on the word segmentation result of the original question text and the word segmentation result of the candidate question text, respectively, to obtain the word feature of the original question text and the word feature of the candidate question text. The word feature of the original question text includes the marked The original word and the corresponding original word part of speech, the word features of the candidate question text include the marked candidate word and the corresponding candidate word part of speech. 5.根据权利要求3所述的基于词语特征的相似度计算方法,其特征在于,所述基于所述目标分词标准对所述原始问题文本和所述候选问题文本分别进行分词和命名实体识别,得到所述原始问题文本的词语特征和所述候选问题文本的词语特征,包括:5. The similarity calculation method based on word features according to claim 3, wherein the original question text and the candidate question text are respectively subjected to word segmentation and named entity recognition based on the target word segmentation criteria, The word features of the original question text and the word features of the candidate question text are obtained, including: 基于所述目标分词标准对原始问题文本进行分词,得到原始问题文本的分词结果;Perform word segmentation on the original question text based on the target word segmentation standard to obtain the word segmentation result of the original question text; 获取所述候选问题文本的预置分词结果,其中,所述预置分词结果为根据所述目标分词标准预先对候选问题文本进行离线分词的结果;Obtaining a preset word segmentation result of the candidate question text, wherein the preset word segmentation result is a result of performing offline word segmentation on the candidate question text in advance according to the target word segmentation standard; 对所述原始问题文本的分词结果和所述候选问题文本的预置分词结果分别进行命名实体识别,得到原始问题文本的词语特征和候选问题文本的词语特征,所述原始问题文本的词语特征包括标注好的原始词语和对应的原始词语词性,所述候选问题文本的词语特征包括标注好的候选词语和对应的候选词语词性。Perform named entity recognition on the word segmentation result of the original question text and the preset word segmentation result of the candidate question text, respectively, to obtain the word feature of the original question text and the word feature of the candidate question text, and the word feature of the original question text includes: The labeled original words and the corresponding original word parts of speech, and the word features of the candidate question text include the labeled candidate words and the corresponding candidate word parts of speech. 6.根据权利要求1所述的基于词语特征的相似度计算方法,其特征在于,所述根据所述原始问题文本的词语特征和所述候选问题文本的词语特征分别进行计算,得到正向文本相似度和反向文本相似度,包括:6. The word feature-based similarity calculation method according to claim 1, wherein the calculation is performed according to the word feature of the original question text and the word feature of the candidate question text to obtain a forward text Similarity and reverse text similarity, including: 将原始问题文本确定为基准问题文本,将候选问题文本确定为匹配问题文本,并基于预置匹配公式计算得到正向文本相似度,预置匹配公式为
Figure FDA0002368232140000031
其中A表示基准问题文本,B表示匹配问题文本,LA表示基准问题文本A的词语token个数,wA,i表示基准问题文本A中所有层次的token归一化后的权重,tokenA,i表示基准问题文本对应下标的token、tokenB,j表示匹配问题文本对应下标的token,jaccard表示两个token的相似度系数,
Figure FDA0002368232140000032
The original question text is determined as the reference question text, the candidate question text is determined as the matching question text, and the forward text similarity is calculated based on the preset matching formula. The preset matching formula is
Figure FDA0002368232140000031
Among them, A represents the reference question text, B represents the matching question text, L A represents the number of word tokens in the benchmark question text A, w A, i represents the normalized weight of the tokens of all levels in the benchmark question text A, token A, i represents the subscript token corresponding to the reference question text, token B, j represents the subscript token corresponding to the matching question text, jaccard represents the similarity coefficient of the two tokens,
Figure FDA0002368232140000032
将候选问题文本确定为基准问题文本,将原始问题文本确定为匹配问题文本,并基于预置匹配公式计算得到反向文本相似度。The candidate question text is determined as the reference question text, the original question text is determined as the matching question text, and the reverse text similarity is calculated based on the preset matching formula.
7.根据权利要求1-6中任一项所述的基于词语特征的相似度计算方法,其特征在于,所述将所述正向文本相似度和所述反向文本相似度进行特征融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题文本与所述候选问题文本之间的相似程度,包括:7. The word feature-based similarity calculation method according to any one of claims 1-6, wherein the forward text similarity and the reverse text similarity are characterized by feature fusion, Generating a similarity matching score, the similarity matching score is used to indicate the degree of similarity between the original question text and the candidate question text, including: 通过预置公式将正向文本相似度和反向文本相似度进行融合,预置公式为:score=w1*score(正向)+w2*score(反向)+b,其中,b为常数,w1、w2为权重常数;The forward text similarity and reverse text similarity are fused by the preset formula. The preset formula is: score=w1*score(forward)+w2*score(reverse)+b, where b is a constant, w1, w2 are weight constants; 计算得到相似度匹配分值score,所述相似度匹配分值指示所述原始问题文本与所述候选问题文本之间的相似程度。A similarity matching score score is calculated, and the similarity matching score indicates the degree of similarity between the original question text and the candidate question text. 8.一种基于词语特征的相似度计算装置,其特征在于,包括:8. A similarity computing device based on word features, comprising: 获取单元,用于获取原始问题文本,所述原始问题文本用于指示查找所述原始问题文本对应的答案;an obtaining unit, used for obtaining the original question text, the original question text is used for instructing to find the answer corresponding to the original question text; 确定单元,用于根据所述原始问题文本和预置的应用场景确定目标应用场景,并获取所述目标应用场景对应的目标分词标准以及多个语义相似的相似问题文本,所述预置的应用场景包含预先设置的多个候选场景;a determining unit, configured to determine a target application scenario according to the original question text and a preset application scenario, and obtain a target word segmentation standard corresponding to the target application scenario and a plurality of similar question texts with similar semantics, and the preset application scenario The scene contains multiple preset candidate scenes; 选择提取单元,用于在所述多个语义相似的相似问题文本中选择任意一个相似问题文本作为候选问题文本,并根据所述目标分词标准提取所述原始问题文本的词语特征和所述候选问题文本的词语特征;The selection and extraction unit is used to select any one of the similar question texts with similar semantics as the candidate question text, and extract the word feature of the original question text and the candidate question according to the target word segmentation standard word features of the text; 计算单元,用于根据所述原始问题文本的词语特征和所述候选问题文本的词语特征分别进行计算,得到正向文本相似度和反向文本相似度;a computing unit, configured to calculate respectively according to the word feature of the original question text and the word feature of the candidate question text to obtain the forward text similarity and the reverse text similarity; 生成单元,用于将所述正向文本相似度和所述反向文本相似度进行特征融合,生成相似度匹配分值,所述相似度匹配分值用于指示所述原始问题文本与所述候选问题文本之间的相似程度;The generating unit is used for feature fusion of the forward text similarity and the reverse text similarity to generate a similarity matching score, and the similarity matching score is used to indicate that the original question text and the The degree of similarity between candidate question texts; 比较选择单元,用于将所述多个候选问题文本对应的候选相似度值进行比较,将数值最大的候选相似度值确定为目标相似度值,并选择所述目标相似度值对应的候选问题文本作为标准问题文本。A comparison and selection unit is used to compare the candidate similarity values corresponding to the multiple candidate question texts, determine the candidate similarity value with the largest numerical value as the target similarity value, and select the candidate question corresponding to the target similarity value Text as standard question text. 9.一种基于词语特征的相似度计算设备,其特征在于,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1-7中任意一项所述的基于词语特征的相似度计算方法。9. A word feature-based similarity computing device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, the processor executing the computer The program implements the word feature-based similarity calculation method according to any one of claims 1-7. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被处理器执行时实现如权利要求1-7中任意一项所述的基于词语特征的相似度计算方法。10. A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program according to any one of claims 1-7 is implemented. Similarity calculation method based on word features.
CN202010042471.4A 2020-01-15 2020-01-15 Similarity calculation method, device, equipment and storage medium based on word features Active CN111259126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010042471.4A CN111259126B (en) 2020-01-15 2020-01-15 Similarity calculation method, device, equipment and storage medium based on word features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010042471.4A CN111259126B (en) 2020-01-15 2020-01-15 Similarity calculation method, device, equipment and storage medium based on word features

Publications (2)

Publication Number Publication Date
CN111259126A true CN111259126A (en) 2020-06-09
CN111259126B CN111259126B (en) 2024-11-19

Family

ID=70950454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010042471.4A Active CN111259126B (en) 2020-01-15 2020-01-15 Similarity calculation method, device, equipment and storage medium based on word features

Country Status (1)

Country Link
CN (1) CN111259126B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113962216A (en) * 2021-09-15 2022-01-21 北京三快在线科技有限公司 Text processing method and device, electronic equipment and readable storage medium
CN114372122A (en) * 2021-12-08 2022-04-19 阿里云计算有限公司 Information acquisition method, computing device and storage medium
CN116757189A (en) * 2023-08-11 2023-09-15 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN108052509A (en) * 2018-01-31 2018-05-18 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server
CN108170749A (en) * 2017-12-21 2018-06-15 北京百度网讯科技有限公司 Dialogue method, device and computer-readable medium based on artificial intelligence

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN108170749A (en) * 2017-12-21 2018-06-15 北京百度网讯科技有限公司 Dialogue method, device and computer-readable medium based on artificial intelligence
CN108052509A (en) * 2018-01-31 2018-05-18 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113962216A (en) * 2021-09-15 2022-01-21 北京三快在线科技有限公司 Text processing method and device, electronic equipment and readable storage medium
CN113962216B (en) * 2021-09-15 2024-12-24 北京三快在线科技有限公司 Text processing method, device, electronic device and readable storage medium
CN114372122A (en) * 2021-12-08 2022-04-19 阿里云计算有限公司 Information acquisition method, computing device and storage medium
CN116757189A (en) * 2023-08-11 2023-09-15 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features
CN116757189B (en) * 2023-08-11 2023-10-31 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features

Also Published As

Publication number Publication date
CN111259126B (en) 2024-11-19

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN110188168A (en) Semantic relationship recognition method and device
CN111767738B (en) A label verification method, device, equipment and storage medium
CN111382255A (en) Method, apparatus, device and medium for question and answer processing
CN106294505B (en) Answer feedback method and device
CN111259126A (en) Method, Apparatus, Equipment and Storage Medium for Similarity Calculation Based on Word Features
CN112860865A (en) Method, device, equipment and storage medium for realizing intelligent question answering
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN113407677B (en) Method, apparatus, device and storage medium for evaluating consultation dialogue quality
CN115470338A (en) A multi-scene intelligent question answering method and system based on multi-channel recall
CN117076650A (en) An intelligent dialogue method, device, medium and equipment based on a large language model
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
CN112560485A (en) Entity linking method and device, electronic equipment and storage medium
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN118467699A (en) Private domain document question and answer method, device, electronic device and storage medium
CN110287284B (en) Semantic matching method, device and equipment
CN111782762A (en) Method, device and electronic device for determining similar questions in question answering applications
CN116955559A (en) Question-answer matching method and device, electronic equipment and storage medium
WO2022053018A1 (en) Text clustering system, method and apparatus, and device and medium
CN115329129A (en) Conference summary file generation method and device, electronic equipment and storage medium
CN115269797A (en) Answer recommendation method and system for fuzzy questions in knowledge community
CN110866393B (en) Resume information extraction method and system based on domain knowledge base
CN115329083A (en) Document classification method and device, computer equipment and storage medium
CN112464101A (en) Electronic book sorting recommendation method, electronic device and storage medium
Pattar et al. Automated Mapping of Service Level Agreements to Application Programming Interfaces of Different Cloud Service Providers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant