WO2021051763A1 - Term matching method and apparatus, terminal, and computer readable storage medium - Google Patents

Term matching method and apparatus, terminal, and computer readable storage medium Download PDF

Info

Publication number
WO2021051763A1
WO2021051763A1 PCT/CN2020/079603 CN2020079603W WO2021051763A1 WO 2021051763 A1 WO2021051763 A1 WO 2021051763A1 CN 2020079603 W CN2020079603 W CN 2020079603W WO 2021051763 A1 WO2021051763 A1 WO 2021051763A1
Authority
WO
WIPO (PCT)
Prior art keywords
term
similarity
value
phrase list
word
Prior art date
Application number
PCT/CN2020/079603
Other languages
French (fr)
Chinese (zh)
Inventor
王利
Original Assignee
深圳中兴网信科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳中兴网信科技有限公司 filed Critical 深圳中兴网信科技有限公司
Publication of WO2021051763A1 publication Critical patent/WO2021051763A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of medical informatization, for example, it relates to a term matching method, a term matching device, a terminal, and a computer-readable storage medium.
  • Medical terms are professional terms in the medical field, used to refer to various things, phenomena, characteristics, relationships, and processes in the medical field, such as diseases, drugs, surgical operations, inspections, etc. These terms are essential components of the clinical information system to express medical information.
  • This application provides a term matching method, including: calculating multiple similarity values between a first term and a second term according to multiple similarity calculation algorithms; assigning weights to each similarity value, and multiple similarity values The corresponding weights are respectively multiplied, and the product results are added to obtain a weighted summation similarity of multiple similarity values, where the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
  • the present application also provides a term matching device, which includes a memory, a processor, and a program stored in the memory and capable of running on the processor.
  • a term matching device which includes a memory, a processor, and a program stored in the memory and capable of running on the processor.
  • the program is executed by the processor, the term matching method as in the above technical solution is implemented.
  • This application also provides a terminal, including: the term matching device described in the above technical solution.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the term matching method defined in the above technical solution is implemented.
  • Fig. 1 shows a schematic flowchart of a term matching method according to an embodiment of the present application
  • Fig. 2 shows a schematic block diagram of a term matching device according to an embodiment of the present application
  • Fig. 3 shows a schematic block diagram of a terminal according to an embodiment of the present application
  • Fig. 4 shows a schematic block diagram of a computer-readable storage medium according to an embodiment of the present application.
  • a term matching method provided by an embodiment of the present application includes:
  • Step 102 Calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms.
  • Step 104 Assign a weight to each similarity value, the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain a weighted sum similarity of the multiple similarity values.
  • the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
  • a variety of similarity calculation methods are used to calculate the similarity of two terms to be matched (the first term and the second term) from multiple dimensions, and the similarity is calculated by weighting.
  • the sum method integrates multiple similarities, and the weighted sum similarity is used to express the matching degree of two terms.
  • multiple similarity values are generated, and the weighting process can balance the influence of multiple similarity calculation methods on the final sum of similarity, and can integrate multiple similarity calculation methods.
  • step S102 and step S104 include: designating a term in the first terminology system as the first term, and taking any term in the second terminology system as the first term.
  • Two terms; multiple similarity values between the first term and the second term are calculated according to multiple similarity calculation algorithms; weights are assigned to each similarity value, and multiple similarity values are multiplied by the corresponding weights respectively, and the product
  • the results are added together to obtain the weighted sum similarity of multiple similarity values; by changing the value of the second term multiple times, a calculation is performed every time the second term is changed, thereby generating multiple weighted sum similarities, where ,
  • the maximum value of the multiple weighted summation similarities is used to indicate the matching degree between a specified term in the first terminology system and a second term in the second terminology system.
  • the terminology system contains multiple terms, and each term consists of a string of characters.
  • first term select a term (first term), and traverse the terms in the second terminology system (second term).
  • second term traverse the terms in the second terminology system
  • Term each time a term is selected from the second terminology system and the term in the first terminology system is selected for weighted sum similarity calculation
  • multiple weighted sum similarity values can be calculated through multiple selections, of which, multiple
  • the term in the second terminology system corresponding to the largest value among the weighted sum similarity values is the matching result.
  • the accuracy of term matching is improved, and the efficiency of establishing term matching mapping relationship is higher. Compared with manual operation, the speed is improved and the error rate is reduced.
  • step S102 and step S104 include: taking a term in the first terminology system as the first term, and taking a term in the second terminology system as the second term. Term; calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms; assign a weight to each similarity value, and multiple similarity values are respectively multiplied by the corresponding weights, and the product is the result Add together to obtain the weighted sum similarity of multiple similarity values; calculate multiple times by changing the value of the first term and the value of the second term to generate multiple weighted sum similarities; The weighted summation similarity performs a summation operation to generate a total matching degree value, where the total matching degree value is used to indicate the matching degree between the first terminology system and the second terminology system.
  • the term system contains multiple terms, and each term consists of a string of characters. Extract a term from the first terminology system and the second terminology system respectively, and calculate the similarity value of the two terms in a variety of ways, and then calculate the weighted sum similarity, after multiple extractions and sum the similarity
  • the calculation (calculates the weighted summation similarity between two terms in the two terminology systems), can get multiple summation similarity values, these similarity values are accumulated to get the total matching value, and the total matching value can be expressed The degree of matching between the first terminology system and the second terminology system.
  • the calculation process further includes: in the step of assigning weights, performing a weighted summation of multiple similarity values through multiple weight combinations, so that each weight combination generates one Total matching degree value, multiple weight combinations generate multiple total matching degree values; record the maximum value of the multiple total matching degree values, which is used to represent the matching result of the first terminology system and the second terminology system.
  • weighted sum similarity between two terms when calculating the weighted sum similarity between two terms, multiple sets of different weight combinations are used to perform a weighted sum calculation on multiple similarity values between the same pair of terms to obtain multiple weighted sums. And similarity, the weighted sum of similarity of multiple pairs of terms can be accumulated to obtain the total matching degree between term systems, and then multiple total matching degrees can be obtained according to different weight combinations, among which the maximum value of multiple total matching degrees Used to indicate the matching result of the first terminology system and the second terminology system.
  • the sum of multiple weights in each group of weights is equal to 1, and the weighted summation similarity obtained by this combination of weights reflects the weighted average similarity of multiple similarity calculation methods.
  • multiple similarity values between the first term and the second term are calculated according to multiple similarity calculation algorithms, including: calculating the difference between the first term and the second term separately Cosine similarity value, Jaccard similarity value, and hash similarity value.
  • multiple similarity calculation algorithms include: cosine similarity (Cosine similarity), Jaccard similarity (Jaccard similarity), and hash similarity (Simhash similarity).
  • Cosine similarity can calculate the similarity between two short texts from the word frequency dimension, and convert (encode) the term into a word frequency vector and then calculate the similarity between the two terms by the Cosine similarity calculation algorithm.
  • Jaccard similarity is also known as Jaccard coefficient.
  • the Jaccard similarity calculation algorithm is used for document data. In the case of binary attributes, two terms are reduced to Jaccard coefficient to know the degree of similarity between the two terms.
  • the Simhash similarity calculation algorithm calculates the Hamming distance between the terms after dimensionality reduction by encoding and dimensionality reduction of terms, and calculates the similarity degree according to the Hamming distance.
  • the calculation methods of the above three similarity calculation algorithms are different, and the calculation focuses are different.
  • Comprehensive consideration of the three similarity values between terms can improve the accuracy of term matching.
  • calculating the cosine similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and comparing the first term with the second term based on the stop word dictionary.
  • a term and a second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; the first phrase list and the second phrase list are encoded to obtain the corresponding The first word frequency vector in the first phrase list and the second word frequency vector corresponding to the second word list; calculate the cosine value between the first word frequency vector and the second word frequency vector, where the cosine value is the first word frequency vector and the second word frequency vector The cosine similarity value of the two-word frequency vector. The larger the cosine value, the higher the similarity.
  • the term is segmented and stopped, the term is disassembled into a list of phrases, and the list of phrases is encoded (for example, one-hot encoding (oneHot encoding)) to obtain the term frequency vector and term frequency vector of the term.
  • the cosine similarity between two terms can be calculated.
  • calculating the Jackard similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and based on the stop word dictionary De-stop words for the first term and the second term, generate a first phrase list corresponding to the first term and a second phrase list corresponding to the second term; encode the first phrase list and the second phrase list, Obtain the first term frequency vector corresponding to the first phrase list and the second term frequency vector corresponding to the second phrase list; calculate the ratio of the intersection and union of the first term frequency vector and the second term frequency vector to obtain the Jeckard similarity Value, among which, the greater the Jaccard similarity value, the higher the similarity.
  • the terms are segmented and stop words are removed, the terms are disassembled into a list of phrases, and the list of phrases is encoded to obtain the vector value of the term.
  • the term can be evaluated. The degree of similarity.
  • calculating the hash similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and pairing the first term with the second term based on the stop word dictionary
  • the first term and the second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; each of the first phrase list and the second phrase list is generated
  • the word is converted into a hash value number string, the hash value number string is multiplied by the weight of the word to obtain the sequence string of each word; the sequence strings of multiple words in the first phrase list are added together to obtain the sequence corresponding to the first phrase
  • the first term sequence string of the list, the sequence strings of multiple words in the second phrase list are added to obtain the second term sequence string corresponding to the second phrase list; the first term sequence string and the second term sequence are respectively
  • the string is converted into a binary string; the Hamming distance between the binary string of the first
  • matching diagnostic term systems from two hospitals, term system A and term system B mainly includes the following processes:
  • Cosine similarity value Calculate the cosine value between word frequency vectors a 1 and b 1. The larger the value, the higher the similarity.
  • Jaccard similarity value Given two sets A, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union. The larger the Jaccard value, the higher the similarity. Among them, the set A corresponds to a 1 , Set B corresponds to b 1 .
  • Simhash similarity value Through the hash algorithm, each word is turned into a hash value number string. For example, “kidney” is calculated as 100101 through the hash algorithm, and “stone” is calculated as 101011 through the hash algorithm.
  • the phrase lists a 1 and b 1 are respectively weighted and summed to obtain ⁇ 12, 27, -33, 5, -1, 7 ⁇ and ⁇ 23, -21, -6, 11, 8, 14 ⁇ .
  • the number string after the weighted summation becomes a 01 string. If a digit is greater than 0, the digit is 1, if a digit is less than or equal to 0, the digit is 0. For example, the phrase list a 1
  • the 01 strings corresponding to b 1 are 110101 and 100111 respectively.
  • T k ⁇ 2.212, 1.876, 2.436, 1.943, 2.113, 2.085 ⁇ .
  • the Simhash similarity value is calculated by the Simhash similarity algorithm.
  • the above steps do not fully disclose all the calculation steps, that is, the Simhash similarity value calculated according to conventional technical means can be used to participate in the weighting proposed by this application. Sum the similarity calculation, and get the matching degree of the term. As the algorithm changes, some steps of the algorithm may change, but the final result of the algorithm can still be applied to the term matching method proposed in this application.
  • a term matching device 200 includes: a memory 202, a processor 204, and a program stored on the memory 202 and running on the processor 204, and the program is executed by the processor 204 When implementing the term matching method as in any of the above embodiments.
  • the term matching device 200 includes the effect of the term matching method as in any one of the above embodiments, and will not be repeated here.
  • a terminal 300 includes: the term matching device 200 described in the third embodiment.
  • the terminal 300 can realize: calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms; assign a weight to each similarity value, and the multiple similarity values are respectively and The corresponding weights are multiplied, and the product results are added to obtain the weighted summation similarity of the multiple similarity values, wherein the weighted summation similarity value is used to represent the first term and the second term The degree of matching of terms.
  • the terminal 300 includes the effect of the term matching method as in any of the foregoing embodiments, and details are not described herein again.
  • a computer-readable storage medium 400 is also provided, on which a computer program 402 is stored.
  • the terminology defined in any of the above embodiments is implemented. Matching method.
  • the computer program 402 When the computer program 402 is executed, it is realized: according to multiple similarity calculation algorithms, multiple similarity values of the first term and the second term are respectively calculated; weights are assigned to each similarity value, and the multiple similarity values are compared with each other. The corresponding weights are multiplied, and the product results are added to obtain a weighted summation similarity of multiple similarity values, where the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
  • a term is designated in the first terminology system as the first term, and any term in the second terminology system is selected as the second term; calculation based on multiple similarities
  • the algorithm calculates multiple similarity values between the first term and the second term; assigns a weight to each similarity value, and the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain multiple similarities
  • the weighted summation similarity of the value by changing the value of the second term multiple times, a calculation is performed every time the second term is changed to generate multiple weighted summation similarities.
  • the multiple weighted summation similarities The maximum value of is used to indicate the matching degree between a specified term in the first terminology system and the second term in the second terminology system.
  • a term is taken in the first terminology system as the first term, and a term in the second terminology system is taken as the second term; according to multiple similarity calculation algorithms Calculate multiple similarity values of the first term and the second term respectively; assign weights to each similarity value, multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain multiple similarity values
  • the weighted summation similarity degree calculates by changing the value of the first term and the value of the second term multiple times to generate multiple weighted summation similarities; performs a summation operation on multiple weighted summation similarities To generate a total matching degree value, where the total matching degree value is used to indicate the matching degree between the first terminology system and the second terminology system.
  • the calculation process further includes: in the step of assigning weights, performing a weighted summation of multiple similarity values through multiple weight combinations, so that each weight combination generates a corresponding one.
  • Total matching degree value multiple weight combinations generate multiple total matching degree values; record the maximum value of the multiple total matching degree values, which is used to represent the matching result of the first terminology system and the second terminology system.
  • multiple similarity calculation algorithms are used to calculate multiple similarity values between the first term and the second term, including: calculating the difference between the first term and the second term separately Cosine similarity value, Jaccard similarity value, and hash similarity value.
  • calculating the cosine similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and comparing the first term and the second term based on the stop word dictionary A term and a second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; the first phrase list and the second phrase list are encoded to obtain the corresponding The first word frequency vector in the first phrase list and the second word frequency vector corresponding to the second word list; calculate the cosine value between the first word frequency vector and the second word frequency vector, where the cosine value is the first word frequency vector and the second word frequency vector The cosine similarity value of the two-word frequency vector. The larger the cosine value, the higher the similarity.
  • calculating the Jackard similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and based on the stop word dictionary De-stop words for the first term and the second term, generate a first phrase list corresponding to the first term and a second phrase list corresponding to the second term; encode the first phrase list and the second phrase list, Obtain the first term frequency vector corresponding to the first phrase list and the second term frequency vector corresponding to the second phrase list; calculate the ratio of the intersection and union of the first term frequency vector and the second term frequency vector to obtain the Jeckard similarity Value, among which, the greater the Jaccard similarity value, the higher the similarity.
  • calculating the hash similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and pairing the first term and the second term based on the stop word dictionary.
  • the first term and the second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; each of the first phrase list and the second phrase list is generated
  • the word is converted into a hash value number string, the hash value number string is multiplied by the weight of the word to obtain the sequence string of each word; the sequence strings of multiple words in the first phrase list are added together to obtain the sequence corresponding to the first phrase
  • the first term sequence string of the list, the sequence strings of multiple words in the second phrase list are added to obtain the second term sequence string corresponding to the second phrase list; the first term sequence string and the second term sequence are respectively
  • the string is converted into a binary string; the Hamming distance between the binary string of the first term sequence string and the binary string of the second term sequence string is calculated; the hash similarity between the first term and the second term is determined according to the Hamming distance , Where the greater the hash similarity value, the higher the similarity.
  • this application can realize automatic matching of terms between terminology systems (term dictionaries), replace manual operations, reduce error rates, and help promote medical treatment. Data integration, analysis and reuse.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may use one or more computer-usable storage media (including but not limited to disk storage, Compact Disc-Read Only Memory (CD-ROM), and optical storage) containing computer-usable program codes. Etc.) in the form of a computer program product implemented on it.
  • CD-ROM Compact Disc-Read Only Memory
  • optical storage containing computer-usable program codes. Etc.
  • These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable term matching equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable term matching equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the word “include” does not exclude the presence of unlisted parts or steps.
  • the word “a” or “an” preceding a component does not exclude the presence of multiple such components.
  • the application can be implemented by means of hardware including different components and by means of a suitably programmed computer. Multiple of the listed devices may be embodied by the same hardware item.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed herein are a term matching method and apparatus, a terminal, and a computer readable storage medium. The term matching method comprises: according to multiple similarity calculation algorithms, respectively calculating multiple similarity values between a first term and a second term; and assigning a weight to each similarity value, respectively multiplying the multiple similarity values by the corresponding weights, and adding the multiplication results to obtain a weighted sum similarity for the multiple similarity values, the value of the weighted sum similarity being used to represent the degree to which the first term and the second term match.

Description

术语匹配方法、装置、终端和计算机可读存储介质Term matching method, device, terminal and computer readable storage medium
本申请要求在2019年09月16日提交中国专利局、申请号为201910869178.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910869178.2 on September 16, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请涉及医疗信息化领域,例如,涉及一种术语匹配方法、一种术语匹配装置、一种终端和一种计算机可读存储介质。This application relates to the field of medical informatization, for example, it relates to a term matching method, a term matching device, a terminal, and a computer-readable storage medium.
背景技术Background technique
医学术语(下文简称术语)是医学领域里的专业用语,用来指称医学领域里的多种事物、现象、特性、关系和过程等,例如,疾病、药物、手术操作、检查检验等。这些术语是临床信息系统表达医学信息的必要成分。Medical terms (hereinafter referred to as terms) are professional terms in the medical field, used to refer to various things, phenomena, characteristics, relationships, and processes in the medical field, such as diseases, drugs, surgical operations, inspections, etc. These terms are essential components of the clinical information system to express medical information.
医学术语相关标准匮乏,体系尚不完整。这些术语标准中的术语在粒度和表达上与临床实际应用场景中的术语都存在很大差异,很难直接应用于临床信息系统中。因此,大部分医疗机构的医学信息系统创建了自己的私有术语字典,由于医疗信息系统厂商众多,同一机构的不同系统的同类术语字典都存在差异,例如,药品术语字典之间存在差异。这些原因导致多个临床信息系统中术语名称和编码的异构现象十分严重,使得医疗信息系统之间无法互操作,医疗数据难以共享。对此,不同医疗信息系统之间的信息交换需要将不同系统的术语字典进行映射匹配。这项工作一般由人工操作,出错率比较高,成为医疗数据集成、分析和再利用的瓶颈环节。There is a lack of relevant medical terminology standards and the system is not yet complete. The granularity and expression of the terms in these terminology standards are very different from those in actual clinical application scenarios, and it is difficult to directly apply them to clinical information systems. Therefore, most medical information systems of medical institutions have created their own private terminology dictionaries. Due to the large number of medical information system vendors, there are differences in similar term dictionaries of different systems in the same institution, for example, there are differences between drug term dictionaries. These reasons cause the heterogeneity of term names and codes in multiple clinical information systems to be very serious, making it impossible to interoperate between medical information systems, and it is difficult to share medical data. In this regard, the exchange of information between different medical information systems needs to map and match the term dictionaries of different systems. This work is generally performed manually, and the error rate is relatively high, which has become a bottleneck in the integration, analysis and reuse of medical data.
发明内容Summary of the invention
本申请提供了一种术语匹配方法,包括:根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度,其中,加权求和相似度的值用于表示第一术语与第二术语的匹配度。This application provides a term matching method, including: calculating multiple similarity values between a first term and a second term according to multiple similarity calculation algorithms; assigning weights to each similarity value, and multiple similarity values The corresponding weights are respectively multiplied, and the product results are added to obtain a weighted summation similarity of multiple similarity values, where the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
本申请还提供了一种术语匹配装置,包括:存储器、处理器及存储在存储器上并可在处理器上运行的程序,程序被处理器执行时实现如上述技术方案的术语匹配方法。The present application also provides a term matching device, which includes a memory, a processor, and a program stored in the memory and capable of running on the processor. When the program is executed by the processor, the term matching method as in the above technical solution is implemented.
本申请还提供了一种终端,包括:上述技术方案所述的术语匹配装置。This application also provides a terminal, including: the term matching device described in the above technical solution.
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,上述 计算机程序被执行时实现上述技术方案限定的术语匹配方法。The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed, the term matching method defined in the above technical solution is implemented.
附图说明Description of the drawings
图1示出了根据本申请实施例提供的一种术语匹配方法的示意流程图;Fig. 1 shows a schematic flowchart of a term matching method according to an embodiment of the present application;
图2示出了根据本申请实施例提供的一种术语匹配装置的示意框图;Fig. 2 shows a schematic block diagram of a term matching device according to an embodiment of the present application;
图3示出了根据本申请实施例提供的一种终端的示意框图;Fig. 3 shows a schematic block diagram of a terminal according to an embodiment of the present application;
图4示出了根据本申请实施例提供的一种计算机可读存储介质的示意框图。Fig. 4 shows a schematic block diagram of a computer-readable storage medium according to an embodiment of the present application.
具体实施方式detailed description
下面结合附图和具体实施方式对本申请进行描述。The application will be described below with reference to the drawings and specific implementations.
在下面的描述中阐述了多个实施方式以便于理解本申请,但是,本申请还可以采用其他不同于在此描述的其他方式来实施,因此,本申请的保护范围并不受下面公开的具体实施例的限制。In the following description, a number of implementation manners are set forth in order to facilitate the understanding of this application. However, this application can also be implemented in other ways different from those described here. Therefore, the scope of protection of this application is not specifically disclosed below. Limitations of the embodiment.
实施例一Example one
如图1所示,本申请实施例提供的一种术语匹配方法,包括:As shown in Figure 1, a term matching method provided by an embodiment of the present application includes:
步骤102,根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值。Step 102: Calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms.
步骤104,为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度。其中,加权求和相似度的值用于表示第一术语与第二术语的匹配度。Step 104: Assign a weight to each similarity value, the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain a weighted sum similarity of the multiple similarity values. Among them, the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
在该实施例中,考虑到术语组成的复杂性,以多种相似度计算方法从多个维度对两个待匹配术语(第一术语与第二术语)的相似度进行计算,并通过加权求和的方式整合多个相似度,以加权求和相似度来表示两个术语的匹配程度。其中,对应于多种相似度计算方法,会生成多个相似度值,而赋予权重的过程能够平衡多种相似度计算方法对最终求和相似度的影响,能够综合多种相似度计算方法的特点对术语的匹配度进行准确表示。提升了术语匹配准确度,解决了人工操作效率低,错误率高的问题,有利于促进医疗信息共享。In this embodiment, considering the complexity of term composition, a variety of similarity calculation methods are used to calculate the similarity of two terms to be matched (the first term and the second term) from multiple dimensions, and the similarity is calculated by weighting. The sum method integrates multiple similarities, and the weighted sum similarity is used to express the matching degree of two terms. Among them, corresponding to multiple similarity calculation methods, multiple similarity values are generated, and the weighting process can balance the influence of multiple similarity calculation methods on the final sum of similarity, and can integrate multiple similarity calculation methods. Features accurately represent the matching degree of terms. It improves the accuracy of term matching, solves the problems of low manual operation efficiency and high error rate, and helps promote medical information sharing.
根据上述实施例的术语匹配方法,在一些应用场景下,步骤S102和步骤S104包括:在第一术语系统中指定一个术语,作为第一术语,在第二术语系统中任取一个术语,作为第二术语;根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度;通 过多次改变第二术语的取值,每改变一次第二术语则进行一次计算,从而生成多个加权求和相似度,其中,多个加权求和相似度中的最大值用于表示第一术语系统中指定一个术语与第二术语系统中的第二术语的匹配度。According to the term matching method of the foregoing embodiment, in some application scenarios, step S102 and step S104 include: designating a term in the first terminology system as the first term, and taking any term in the second terminology system as the first term. Two terms; multiple similarity values between the first term and the second term are calculated according to multiple similarity calculation algorithms; weights are assigned to each similarity value, and multiple similarity values are multiplied by the corresponding weights respectively, and the product The results are added together to obtain the weighted sum similarity of multiple similarity values; by changing the value of the second term multiple times, a calculation is performed every time the second term is changed, thereby generating multiple weighted sum similarities, where , The maximum value of the multiple weighted summation similarities is used to indicate the matching degree between a specified term in the first terminology system and a second term in the second terminology system.
在该实施例中,术语系统中包含多条术语,每个术语由一串字符组成,在第一术语系统中选定一个术语(第一术语),遍历第二术语系统中的术语(第二术语),每次从第二术语系统中选取一个术语与第一术语系统中的术语进行加权求和相似度计算,通过多次选取可计算出多个加权求和相似度值,其中,多个加权求和相似度值中的最大的值对应的第二术语系统中的术语即为匹配结果。提升了术语匹配准确度,建立术语匹配映射关系的效率较高,相较于人工操作提升了速度降低了错误率。In this embodiment, the terminology system contains multiple terms, and each term consists of a string of characters. In the first terminology system, select a term (first term), and traverse the terms in the second terminology system (second term). Term), each time a term is selected from the second terminology system and the term in the first terminology system is selected for weighted sum similarity calculation, multiple weighted sum similarity values can be calculated through multiple selections, of which, multiple The term in the second terminology system corresponding to the largest value among the weighted sum similarity values is the matching result. The accuracy of term matching is improved, and the efficiency of establishing term matching mapping relationship is higher. Compared with manual operation, the speed is improved and the error rate is reduced.
根据上述实施例的术语匹配方法,在一些应用场景下,步骤S102和步骤S104包括:在第一术语系统中取一个术语,作为第一术语,在第二术语系统中取一个术语,作为第二术语;根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度;通过多次改变第一术语的取值与第二术语的取值,进行计算,从而生成多个加权求和相似度;对多个加权求和相似度进行求和运算,生成总匹配度值,其中,总匹配度值用于表示第一术语系统和第二术语系统的匹配度。According to the term matching method of the foregoing embodiment, in some application scenarios, step S102 and step S104 include: taking a term in the first terminology system as the first term, and taking a term in the second terminology system as the second term. Term; calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms; assign a weight to each similarity value, and multiple similarity values are respectively multiplied by the corresponding weights, and the product is the result Add together to obtain the weighted sum similarity of multiple similarity values; calculate multiple times by changing the value of the first term and the value of the second term to generate multiple weighted sum similarities; The weighted summation similarity performs a summation operation to generate a total matching degree value, where the total matching degree value is used to indicate the matching degree between the first terminology system and the second terminology system.
在该实施例中,术语系统中包含多条术语,每个术语由一串字符组成。分别从第一术语系统和第二术语系统中抽取一个术语,并以多种方法求这两个术语的相似度值,进而求取加权求和相似度,经过多次抽取并进行求和相似度的计算(计算出两个术语系统中两两术语之间的加权求和相似度),能够得到多个求和相似度值,这些相似度值累加得到总匹配度值,总匹配度值能够表示第一术语系统和第二术语系统之间的匹配度。In this embodiment, the term system contains multiple terms, and each term consists of a string of characters. Extract a term from the first terminology system and the second terminology system respectively, and calculate the similarity value of the two terms in a variety of ways, and then calculate the weighted sum similarity, after multiple extractions and sum the similarity The calculation (calculates the weighted summation similarity between two terms in the two terminology systems), can get multiple summation similarity values, these similarity values are accumulated to get the total matching value, and the total matching value can be expressed The degree of matching between the first terminology system and the second terminology system.
根据上述实施例的术语匹配方法,可选地,计算过程还包括:在赋予权重的步骤中,通过多种权重组合对多个相似度值进行加权求和,以使每种权重组合对应生成一个总匹配度值,多种权重组合则生成多个总匹配度值;记录多个总匹配度值中的最大值,用于表示第一术语系统与第二术语系统的匹配结果。According to the term matching method of the foregoing embodiment, optionally, the calculation process further includes: in the step of assigning weights, performing a weighted summation of multiple similarity values through multiple weight combinations, so that each weight combination generates one Total matching degree value, multiple weight combinations generate multiple total matching degree values; record the maximum value of the multiple total matching degree values, which is used to represent the matching result of the first terminology system and the second terminology system.
在该实施例中,计算两两术语之间的加权求和相似度时,利用多组不同的权重组合对同一对术语之间的多个相似度值进行加权求和计算,得到多个加权求和相似度,多对术语的加权求和相似度累加可得术语系统之间的总匹配度,则根据不同的权重组合能够求取多个总匹配度,其中,多个总匹配度的最大值用于表示第一术语系统与第二术语系统的匹配结果。可选地,每组权重中多个权重相加等于1,以此种权重组合求出的加权求和相似度反映出多种相似度计算 方法的加权平均相似度。In this embodiment, when calculating the weighted sum similarity between two terms, multiple sets of different weight combinations are used to perform a weighted sum calculation on multiple similarity values between the same pair of terms to obtain multiple weighted sums. And similarity, the weighted sum of similarity of multiple pairs of terms can be accumulated to obtain the total matching degree between term systems, and then multiple total matching degrees can be obtained according to different weight combinations, among which the maximum value of multiple total matching degrees Used to indicate the matching result of the first terminology system and the second terminology system. Optionally, the sum of multiple weights in each group of weights is equal to 1, and the weighted summation similarity obtained by this combination of weights reflects the weighted average similarity of multiple similarity calculation methods.
根据上述实施例的术语匹配方法,可选地,根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值,包括:分别计算出第一术语与第二术语的余弦相似度值、杰卡德相似度值以及哈希相似度值。According to the term matching method of the foregoing embodiment, optionally, multiple similarity values between the first term and the second term are calculated according to multiple similarity calculation algorithms, including: calculating the difference between the first term and the second term separately Cosine similarity value, Jaccard similarity value, and hash similarity value.
在该实施例中,多种相似度计算算法包括:余弦相似度(Cosine相似度)、杰卡德相似度(Jaccard相似度)以及哈希相似度(Simhash相似度)。其中,Cosine相似度能够从词频维度计算两个短文本之间的相似程度,将术语转换(编码)为词频向量再由Cosine相似度计算算法进行计算能够获知两个术语之间的相似程度。Jaccard相似度又称Jaccard系数,Jaccard相似度计算算法用于文档数据,在二元属性情况下将两个术语归约为Jaccard系数,以获知两个术语之间的相似程度。Simhash相似度计算算法通过对术语进行编码和降维,计算降维后的术语之间的海明距离,根据海明距离计算相似程度。上述三种相似度计算算法计算方法不相同,计算侧重点不同,综合考虑术语之间三种相似度值能够提升术语匹配准确度。In this embodiment, multiple similarity calculation algorithms include: cosine similarity (Cosine similarity), Jaccard similarity (Jaccard similarity), and hash similarity (Simhash similarity). Among them, Cosine similarity can calculate the similarity between two short texts from the word frequency dimension, and convert (encode) the term into a word frequency vector and then calculate the similarity between the two terms by the Cosine similarity calculation algorithm. Jaccard similarity is also known as Jaccard coefficient. The Jaccard similarity calculation algorithm is used for document data. In the case of binary attributes, two terms are reduced to Jaccard coefficient to know the degree of similarity between the two terms. The Simhash similarity calculation algorithm calculates the Hamming distance between the terms after dimensionality reduction by encoding and dimensionality reduction of terms, and calculates the similarity degree according to the Hamming distance. The calculation methods of the above three similarity calculation algorithms are different, and the calculation focuses are different. Comprehensive consideration of the three similarity values between terms can improve the accuracy of term matching.
根据上述实施例的术语匹配方法,可选地,计算第一术语与第二术语的余弦相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;对第一词组列表和第二词组列表进行编码,得到对应于第一词组列表的第一词频向量以及对应于第二词组列表的第二词频向量;计算第一词频向量和第二词频向量之间的余弦值,其中,余弦值即第一词频向量和第二词频向量的余弦相似度值,余弦值越大表示相似度越高。According to the term matching method of the foregoing embodiment, optionally, calculating the cosine similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and comparing the first term with the second term based on the stop word dictionary. A term and a second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; the first phrase list and the second phrase list are encoded to obtain the corresponding The first word frequency vector in the first phrase list and the second word frequency vector corresponding to the second word list; calculate the cosine value between the first word frequency vector and the second word frequency vector, where the cosine value is the first word frequency vector and the second word frequency vector The cosine similarity value of the two-word frequency vector. The larger the cosine value, the higher the similarity.
在该实施例中,先对术语进行分词、去停用词操作,将术语拆解为词组列表,对词组列表进行编码(例如,独热编码(oneHot编码))获取术语的词频向量,词频向量作为余弦相似度计算算法的输入,能够计算出两两术语之间的余弦相似度。In this embodiment, the term is segmented and stopped, the term is disassembled into a list of phrases, and the list of phrases is encoded (for example, one-hot encoding (oneHot encoding)) to obtain the term frequency vector and term frequency vector of the term. As the input of the cosine similarity calculation algorithm, the cosine similarity between two terms can be calculated.
根据上述实施例的术语匹配方法,可选地,计算第一术语与第二术语的杰卡德相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;对第一词组列表和第二词组列表进行编码,得到对应于第一词组列表的第一词频向量以及对应于第二词组列表的第二词频向量;计算第一词频向量与第二词频向量的交集与并集的比值,以获取杰卡德相似度值,其中,杰卡德相似度值越大表示相似度越高。According to the term matching method of the foregoing embodiment, optionally, calculating the Jackard similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and based on the stop word dictionary De-stop words for the first term and the second term, generate a first phrase list corresponding to the first term and a second phrase list corresponding to the second term; encode the first phrase list and the second phrase list, Obtain the first term frequency vector corresponding to the first phrase list and the second term frequency vector corresponding to the second phrase list; calculate the ratio of the intersection and union of the first term frequency vector and the second term frequency vector to obtain the Jeckard similarity Value, among which, the greater the Jaccard similarity value, the higher the similarity.
在该实施例中,先对术语进行分词、去停用词操作,将术语拆解为词组列表,对词组列表进行编码获取术语的向量值,根据杰卡德相似度计算算法能够 评价术语之间的相似程度。In this embodiment, the terms are segmented and stop words are removed, the terms are disassembled into a list of phrases, and the list of phrases is encoded to obtain the vector value of the term. According to the Jaccard similarity calculation algorithm, the term can be evaluated. The degree of similarity.
根据上述实施例的术语匹配方法,可选地,计算第一术语与第二术语的哈希相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;将第一词组列表和第二词组列表中的每个词转换为哈希值数字串,哈希值数字串乘以词的权重,得到每个词的序列串;将第一词组列表中的多个词的序列串相加,得到对应于第一词组列表的第一术语序列串,将第二词组列表中的多个词的序列串相加,得到对应于第二词组列表的第二术语序列串;分别将第一术语序列串和第二术语序列串转换为二进制串;计算第一术语序列串的二进制串和第二术语序列串的二进制串之间的海明距离;根据海明距离确定第一术语与第二术语之间的哈希相似度,其中,哈希相似度值越大表示相似度越高。哈希相似度的计算公式为:S=1/(h+1),其中,S为哈希相似度,h为海明距离。According to the term matching method of the foregoing embodiment, optionally, calculating the hash similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and pairing the first term with the second term based on the stop word dictionary The first term and the second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; each of the first phrase list and the second phrase list is generated The word is converted into a hash value number string, the hash value number string is multiplied by the weight of the word to obtain the sequence string of each word; the sequence strings of multiple words in the first phrase list are added together to obtain the sequence corresponding to the first phrase The first term sequence string of the list, the sequence strings of multiple words in the second phrase list are added to obtain the second term sequence string corresponding to the second phrase list; the first term sequence string and the second term sequence are respectively The string is converted into a binary string; the Hamming distance between the binary string of the first term sequence string and the binary string of the second term sequence string is calculated; the hash similarity between the first term and the second term is determined according to the Hamming distance , Where the greater the hash similarity value, the higher the similarity. The calculation formula of the hash similarity is: S=1/(h+1), where S is the hash similarity and h is the Hamming distance.
在该实施例中,先将术语拆解为词组列表,再对词组列表中的每个词(单词)进行哈希(hash)转换(计算单词的hash值),根据单词的重要程度对每个单词进行加权计算,加权后的哈希数字串累加得到术语的序列值,对序列值进行降维后可计算出术语之间的海明距离,根据公式S=1/(h+1),获取哈希相似度值,用以表示术语之间的相似程度。In this embodiment, the term is first disassembled into a phrase list, and then each word (word) in the phrase list is hashed (calculated by the hash value of the word), and each word (word) is calculated according to the importance of the word. Words are weighted, and the weighted hash number string is accumulated to obtain the sequence value of the term. After the dimensionality of the sequence value is reduced, the Hamming distance between the terms can be calculated. According to the formula S=1/(h+1), obtain The hash similarity value is used to indicate the degree of similarity between terms.
实施例二Example two
根据实施例一提供的术语匹配方法,对来自两个医院的的诊断术语系统,术语系统A和术语系统B,进行匹配,主要包括如下过程:According to the term matching method provided in the first embodiment, matching diagnostic term systems from two hospitals, term system A and term system B, mainly includes the following processes:
从术语系统A中取出一条术语a 1“肾和输尿管结石”,从术语系统B中取出一条术语b 1“肾结石伴有输尿管结石”。 Take the term a 1 "kidney and ureteral stones" from the term system A, and take the term b 1 "kidney stones with ureteral stones" from the term system B.
对术语a 1和术语b 1采用相同的方法进行预处理: Use the same method to preprocess term a 1 and term b 1:
首先基于分词词典进行分词,然后基于停用词词典去停用词,得到两个词组列表a 1“[‘肾’,‘和’,‘输尿管’,‘结石’]”和b 1“[‘肾’,‘结石’,‘伴有’,‘输尿管’,‘结石’]”。 First perform word segmentation based on the word segmentation dictionary, and then remove the stop words based on the stop word dictionary, and get two list of phrases a 1 "['kidney','and','ureter','calculus']" and b 1 "[''Kidney','calculi','accompanied','ureter','calculi']".
对词组列表a 1和b 1进行oneHot编码,得到词频向量a 1“[1,1,1,1,0]”和b 1“[1,2,1,0,1]”。 Perform oneHot encoding on the phrase lists a 1 and b 1 to obtain the word frequency vectors a 1 "[1,1,1,1,0]" and b 1 "[1,2,1,0,1]".
分别计算词频向量a 1和b 1的余弦相似度值
Figure PCTCN2020079603-appb-000001
Jaccard相似度值
Figure PCTCN2020079603-appb-000002
Simhash相似度值
Figure PCTCN2020079603-appb-000003
其中,三种相似度计算方法都需要执行,三种计算方法得到的相似度值共同参与加权求和计算。
Calculate the cosine similarity values of word frequency vectors a 1 and b 1 respectively
Figure PCTCN2020079603-appb-000001
Jaccard similarity value
Figure PCTCN2020079603-appb-000002
Simhash similarity value
Figure PCTCN2020079603-appb-000003
Among them, all three similarity calculation methods need to be executed, and the similarity values obtained by the three calculation methods jointly participate in the weighted sum calculation.
余弦相似度值:计算词频向量a 1和b 1之间的余弦值,值越大相似度越高。 Cosine similarity value: Calculate the cosine value between word frequency vectors a 1 and b 1. The larger the value, the higher the similarity.
余弦相似度值
Figure PCTCN2020079603-appb-000004
根据如下公式进行计算:
Cosine similarity value
Figure PCTCN2020079603-appb-000004
Calculate according to the following formula:
Figure PCTCN2020079603-appb-000005
Figure PCTCN2020079603-appb-000005
Jaccard相似度值:给定两个集合A,B,Jaccard系数定义为A与B交集的大小与并集大小的比值,Jaccard值越大说明相似度越高,其中,集合A对应于a 1,集合B对应于b 1Jaccard similarity value: Given two sets A, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union. The larger the Jaccard value, the higher the similarity. Among them, the set A corresponds to a 1 , Set B corresponds to b 1 .
Jaccard相似度值
Figure PCTCN2020079603-appb-000006
根据如下公式进行计算:
Jaccard similarity value
Figure PCTCN2020079603-appb-000006
Calculate according to the following formula:
Figure PCTCN2020079603-appb-000007
Figure PCTCN2020079603-appb-000007
Simhash相似度值:通过hash算法把每个词变成hash值数字串,比如“肾”通过hash算法计算为100101,“结石”通过hash算法计算为101011。Simhash similarity value: Through the hash algorithm, each word is turned into a hash value number string. For example, "kidney" is calculated as 100101 through the hash algorithm, and "stone" is calculated as 101011 through the hash algorithm.
以每个词出现次数作为权重乘以数字串,并按照每一位数字将所有数字串相加,如果一位数字为0,当作-1计算。例如,将词组列表a 1和b 1分别加权求和之后得到{12,27,-33,5,-1,7}和{23,-21,-6,11,8,14}。 Multiply the number string by the number of occurrences of each word as the weight, and add all the number strings according to each digit. If a digit is 0, it is counted as -1. For example, the phrase lists a 1 and b 1 are respectively weighted and summed to obtain {12, 27, -33, 5, -1, 7} and {23, -21, -6, 11, 8, 14}.
将加权求和后的数字串变成01串,如果一位数字大于0,则此位数字为1,如果一位数字小于或等于0,则此位数字为0,比如将词组列表a 1和b 1分别对应的01串为110101和100111。 The number string after the weighted summation becomes a 01 string. If a digit is greater than 0, the digit is 1, if a digit is less than or equal to 0, the digit is 0. For example, the phrase list a 1 The 01 strings corresponding to b 1 are 110101 and 100111 respectively.
计算海明距离h:两个数字串对应位上编码不同的位数之和,术语a 1和b 1的海明距离为2。 Calculate the Hamming distance h: the sum of the digits with different codes on the corresponding bits of the two number strings. The Hamming distance of the terms a 1 and b 1 is 2.
Simhash相似度值
Figure PCTCN2020079603-appb-000008
根据如下公式进行计算:
Simhash similarity value
Figure PCTCN2020079603-appb-000008
Calculate according to the following formula:
Figure PCTCN2020079603-appb-000009
Figure PCTCN2020079603-appb-000009
赋予权重:采用三个权重
Figure PCTCN2020079603-appb-000010
计算加权求和相似度s 11,其中,
Figure PCTCN2020079603-appb-000011
Give weight: use three weights
Figure PCTCN2020079603-appb-000010
Calculate the weighted summation similarity s 11 , where,
Figure PCTCN2020079603-appb-000011
Figure PCTCN2020079603-appb-000012
Figure PCTCN2020079603-appb-000012
针对术语系统A中的术语a 1和术语系统B中任一术语b j,采用同样过程和权重
Figure PCTCN2020079603-appb-000013
计算a 1和b j加权平均相似度加权求和相似度s 1j,将s 1j中的最大值记为
Figure PCTCN2020079603-appb-000014
比如s 1j={0.456,0.538,0.324,0.647,0.489},则
Figure PCTCN2020079603-appb-000015
For term a 1 in terminology system A and any term b j in terminology system B, the same process and weight are adopted
Figure PCTCN2020079603-appb-000013
Calculate the weighted average similarity of a 1 and b j and the weighted sum similarity s 1j , and record the maximum value of s 1j as
Figure PCTCN2020079603-appb-000014
For example, s 1j = {0.456, 0.538, 0.324, 0.647, 0.489}, then
Figure PCTCN2020079603-appb-000015
针对术语系统A中的任一术语a i和术语系统B中任一术语b j,采用同样过程采用同样过程和权重
Figure PCTCN2020079603-appb-000016
计算a i和b i加权平均相似度加权求和相似度s ij
Figure PCTCN2020079603-appb-000017
比如
Figure PCTCN2020079603-appb-000018
For any term a i in the terminology system A and any term b j in the terminology system B, use the same process and the same weight
Figure PCTCN2020079603-appb-000016
Calculating a i and b i weighted sum weighted average similarity and similarity s ij
Figure PCTCN2020079603-appb-000017
such as
Figure PCTCN2020079603-appb-000018
计算术语系统A和术语系统B在权重
Figure PCTCN2020079603-appb-000019
下的总匹配度T 1
Calculate the weight of term system A and term system B
Figure PCTCN2020079603-appb-000019
The total matching degree T 1 below :
Figure PCTCN2020079603-appb-000020
Figure PCTCN2020079603-appb-000020
选择多组权重
Figure PCTCN2020079603-appb-000021
计算术语系统A和术语系统B的多个总匹配度T k,比如T k={2.212,1.876,2.436,1.943,2.113,2.085}。
Select multiple sets of weights
Figure PCTCN2020079603-appb-000021
Calculate multiple total matching degrees T k of term system A and term system B, for example, T k ={2.212, 1.876, 2.436, 1.943, 2.113, 2.085}.
以术语系统A和术语系统B的总匹配度的最大值作为术语系统A和术语系统B之间术语匹配的结果:Take the maximum value of the total matching degree between term system A and term system B as the result of term matching between term system A and term system B:
第三组权重对应的总匹配度值为2.436,即k=3时术语系统A和术语系统B之间的匹配结果2.436作为最终匹配结果。The total matching degree value corresponding to the third group of weights is 2.436, that is, the matching result between term system A and term system B when k=3 is 2.436 as the final matching result.
在上述步骤中,Simhash相似度值是由Simhash相似度算法计算得出,上述步骤未完全公开全部计算步骤,即根据常规技术手段计算而得的Simhash相似度值均可用于参与本申请提出的加权求和相似度计算,并得到术语的匹配度。随着算法的改变,该算法部分步骤可能会有所改变,但该算法最终结果仍可应用在本申请提出的术语匹配方法中。In the above steps, the Simhash similarity value is calculated by the Simhash similarity algorithm. The above steps do not fully disclose all the calculation steps, that is, the Simhash similarity value calculated according to conventional technical means can be used to participate in the weighting proposed by this application. Sum the similarity calculation, and get the matching degree of the term. As the algorithm changes, some steps of the algorithm may change, but the final result of the algorithm can still be applied to the term matching method proposed in this application.
实施例三Example three
如图2所示,根据本申请的一个实施例的术语匹配装置200,包括:存储器202、处理器204及存储在存储器202上并可在处理器204上运行的程序,程序被处理器204执行时实现如上述任一实施例的术语匹配方法。该术语匹配装置200包括如上述任一项实施例的术语匹配方法的效果,在此不再赘述。As shown in FIG. 2, a term matching device 200 according to an embodiment of the present application includes: a memory 202, a processor 204, and a program stored on the memory 202 and running on the processor 204, and the program is executed by the processor 204 When implementing the term matching method as in any of the above embodiments. The term matching device 200 includes the effect of the term matching method as in any one of the above embodiments, and will not be repeated here.
实施例四Example four
如图3所示,根据本申请的一个实施例的终端300,包括:实施例三所述的术语匹配装置200。该终端300运行时能够实现:根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,所述多个相似度值分别与对应的权重相乘,乘积结果相加,得到所述多个相似度值的加权求和相似度,其中,所述加权求和相似度的值用于表示所述第一术语与所述第二术语的匹配度。该终端300包括如上述任一实施例的术语匹配方法的效果,在此不再赘述。As shown in FIG. 3, a terminal 300 according to an embodiment of the present application includes: the term matching device 200 described in the third embodiment. When the terminal 300 is running, it can realize: calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms; assign a weight to each similarity value, and the multiple similarity values are respectively and The corresponding weights are multiplied, and the product results are added to obtain the weighted summation similarity of the multiple similarity values, wherein the weighted summation similarity value is used to represent the first term and the second term The degree of matching of terms. The terminal 300 includes the effect of the term matching method as in any of the foregoing embodiments, and details are not described herein again.
实施例五Example five
如图4所示,根据本申请的一个实施例,还提供了一种计算机可读存储介质400,其上存储有计算机程序402,上述计算机程序402被执行时实现上述任一实施例限定的术语匹配方法。As shown in FIG. 4, according to an embodiment of the present application, a computer-readable storage medium 400 is also provided, on which a computer program 402 is stored. When the computer program 402 is executed, the terminology defined in any of the above embodiments is implemented. Matching method.
其中,计算机程序402被执行时实现:根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似 度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度,其中,加权求和相似度的值用于表示第一术语与第二术语的匹配度。When the computer program 402 is executed, it is realized: according to multiple similarity calculation algorithms, multiple similarity values of the first term and the second term are respectively calculated; weights are assigned to each similarity value, and the multiple similarity values are compared with each other. The corresponding weights are multiplied, and the product results are added to obtain a weighted summation similarity of multiple similarity values, where the weighted summation similarity value is used to indicate the degree of matching between the first term and the second term.
根据上述技术方案的计算机程序402,可选地,在第一术语系统中指定一个术语,作为第一术语,在第二术语系统中任取一个术语,作为第二术语;根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度;通过多次改变第二术语的取值,每改变一次第二术语则进行一次计算,从而生成多个加权求和相似度,其中,多个加权求和相似度中的最大值用于表示第一术语系统中指定一个术语与第二术语系统中的第二术语的匹配度。According to the computer program 402 of the above technical solution, optionally, a term is designated in the first terminology system as the first term, and any term in the second terminology system is selected as the second term; calculation based on multiple similarities The algorithm calculates multiple similarity values between the first term and the second term; assigns a weight to each similarity value, and the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain multiple similarities The weighted summation similarity of the value; by changing the value of the second term multiple times, a calculation is performed every time the second term is changed to generate multiple weighted summation similarities. Among them, the multiple weighted summation similarities The maximum value of is used to indicate the matching degree between a specified term in the first terminology system and the second term in the second terminology system.
根据上述技术方案的计算机程序402,可选地,在第一术语系统中取一个术语,作为第一术语,在第二术语系统中取一个术语,作为第二术语;根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;为每个相似度值赋予权重,多个相似度值分别与对应的权重相乘,乘积结果相加,得到多个相似度值的加权求和相似度;通过多次改变第一术语的取值与第二术语的取值,进行计算,从而生成多个加权求和相似度;对多个加权求和相似度进行求和运算,生成总匹配度值,其中,总匹配度值用于表示第一术语系统和第二术语系统的匹配度。According to the computer program 402 of the above technical solution, optionally, a term is taken in the first terminology system as the first term, and a term in the second terminology system is taken as the second term; according to multiple similarity calculation algorithms Calculate multiple similarity values of the first term and the second term respectively; assign weights to each similarity value, multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain multiple similarity values The weighted summation similarity degree; calculates by changing the value of the first term and the value of the second term multiple times to generate multiple weighted summation similarities; performs a summation operation on multiple weighted summation similarities To generate a total matching degree value, where the total matching degree value is used to indicate the matching degree between the first terminology system and the second terminology system.
根据上述技术方案的计算机程序402,可选地,计算过程还包括:在赋予权重的步骤中,通过多种权重组合对多个相似度值进行加权求和,以使每种权重组合对应生成一个总匹配度值,多种权重组合则生成多个总匹配度值;记录多个总匹配度值中的最大值,用于表示第一术语系统与第二术语系统的匹配结果。According to the computer program 402 of the above technical solution, optionally, the calculation process further includes: in the step of assigning weights, performing a weighted summation of multiple similarity values through multiple weight combinations, so that each weight combination generates a corresponding one. Total matching degree value, multiple weight combinations generate multiple total matching degree values; record the maximum value of the multiple total matching degree values, which is used to represent the matching result of the first terminology system and the second terminology system.
根据上述技术方案的计算机程序402,可选地,根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值,包括:分别计算出第一术语与第二术语的余弦相似度值、杰卡德相似度值以及哈希相似度值。According to the computer program 402 of the foregoing technical solution, optionally, multiple similarity calculation algorithms are used to calculate multiple similarity values between the first term and the second term, including: calculating the difference between the first term and the second term separately Cosine similarity value, Jaccard similarity value, and hash similarity value.
根据上述技术方案的计算机程序402,可选地,计算第一术语与第二术语的余弦相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;对第一词组列表和第二词组列表进行编码,得到对应于第一词组列表的第一词频向量以及对应于第二词组列表的第二词频向量;计算第一词频向量和第二词频向量之间的余弦值,其中,余弦值即第一词频向量和第二词频向量的余弦相似度值,余弦值越大表示相似度越高。According to the computer program 402 of the above technical solution, optionally, calculating the cosine similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and comparing the first term and the second term based on the stop word dictionary A term and a second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; the first phrase list and the second phrase list are encoded to obtain the corresponding The first word frequency vector in the first phrase list and the second word frequency vector corresponding to the second word list; calculate the cosine value between the first word frequency vector and the second word frequency vector, where the cosine value is the first word frequency vector and the second word frequency vector The cosine similarity value of the two-word frequency vector. The larger the cosine value, the higher the similarity.
根据上述技术方案的计算机程序402,可选地,计算第一术语与第二术语的 杰卡德相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;对第一词组列表和第二词组列表进行编码,得到对应于第一词组列表的第一词频向量以及对应于第二词组列表的第二词频向量;计算第一词频向量与第二词频向量的交集与并集的比值,以获取杰卡德相似度值,其中,杰卡德相似度值越大表示相似度越高。According to the computer program 402 of the above technical solution, optionally, calculating the Jackard similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and based on the stop word dictionary De-stop words for the first term and the second term, generate a first phrase list corresponding to the first term and a second phrase list corresponding to the second term; encode the first phrase list and the second phrase list, Obtain the first term frequency vector corresponding to the first phrase list and the second term frequency vector corresponding to the second phrase list; calculate the ratio of the intersection and union of the first term frequency vector and the second term frequency vector to obtain the Jeckard similarity Value, among which, the greater the Jaccard similarity value, the higher the similarity.
根据上述技术方案的计算机程序402,可选地,计算第一术语与第二术语的哈希相似度值,包括:基于分词词典对第一术语与第二术语进行分词,基于停用词词典对第一术语与第二术语进行去停用词,生成对应于第一术语的第一词组列表和对应于第二术语的第二词组列表;将第一词组列表和第二词组列表中的每个词转换为哈希值数字串,哈希值数字串乘以词的权重,得到每个词的序列串;将第一词组列表中的多个词的序列串相加,得到对应于第一词组列表的第一术语序列串,将第二词组列表中的多个词的序列串相加,得到对应于第二词组列表的第二术语序列串;分别将第一术语序列串和第二术语序列串转换为二进制串;计算第一术语序列串的二进制串和第二术语序列串的二进制串之间的海明距离;根据海明距离确定第一术语与第二术语之间的哈希相似度,其中,哈希相似度值越大表示相似度越高。哈希相似度的计算公式为:S=1/(h+1),其中,S为哈希相似度,h为海明距离。According to the computer program 402 of the above technical solution, optionally, calculating the hash similarity value between the first term and the second term includes: segmenting the first term and the second term based on the word segmentation dictionary, and pairing the first term and the second term based on the stop word dictionary. The first term and the second term are used to remove stop words, and a first phrase list corresponding to the first term and a second phrase list corresponding to the second term are generated; each of the first phrase list and the second phrase list is generated The word is converted into a hash value number string, the hash value number string is multiplied by the weight of the word to obtain the sequence string of each word; the sequence strings of multiple words in the first phrase list are added together to obtain the sequence corresponding to the first phrase The first term sequence string of the list, the sequence strings of multiple words in the second phrase list are added to obtain the second term sequence string corresponding to the second phrase list; the first term sequence string and the second term sequence are respectively The string is converted into a binary string; the Hamming distance between the binary string of the first term sequence string and the binary string of the second term sequence string is calculated; the hash similarity between the first term and the second term is determined according to the Hamming distance , Where the greater the hash similarity value, the higher the similarity. The calculation formula of the hash similarity is: S=1/(h+1), where S is the hash similarity and h is the Hamming distance.
本申请通过上述实施例公开的术语匹配方法、装置、终端和计算机可读存储介质,能够实现术语系统(术语词典)之间术语的自动匹配,代替人工操作,降低错误率,有助于促进医疗数据集成、分析和再利用。Through the term matching method, device, terminal and computer-readable storage medium disclosed in the above embodiments, this application can realize automatic matching of terms between terminology systems (term dictionaries), replace manual operations, reduce error rates, and help promote medical treatment. Data integration, analysis and reuse.
本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光盘只读存储器(Compact Disc-Read Only Memory,CD-ROM)、光学存储器等)上实施的计算机程序产品的形式。The embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may use one or more computer-usable storage media (including but not limited to disk storage, Compact Disc-Read Only Memory (CD-ROM), and optical storage) containing computer-usable program codes. Etc.) in the form of a computer program product implemented on it.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程术语匹配设备的处理器以产生一个机器,使得通过计算机或其他可编程术语匹配设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. Each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable term matching device to generate a machine, so that the instructions executed by the processor of the computer or other programmable term matching device are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程术语匹配设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable term matching equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程术语匹配设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable term matching equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
单词“包含”不排除存在未列出的部件或步骤。位于部件之前的单词“一”或“一个”不排除存在多个这样的部件。本申请可以借助于包括有不同部件的硬件以及借助于适当编程的计算机来实现。列举装置中的多个可以是通过同一个硬件项来体现。The word "include" does not exclude the presence of unlisted parts or steps. The word "a" or "an" preceding a component does not exclude the presence of multiple such components. The application can be implemented by means of hardware including different components and by means of a suitably programmed computer. Multiple of the listed devices may be embodied by the same hardware item.

Claims (12)

  1. 一种术语匹配方法,包括:A term matching method including:
    根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值;Calculate multiple similarity values between the first term and the second term according to multiple similarity calculation algorithms;
    为每个相似度值赋予权重,所述多个相似度值分别与对应的权重相乘,乘积结果相加,得到所述多个相似度值的加权求和相似度,其中,所述加权求和相似度的值用于表示所述第一术语与所述第二术语的匹配度。A weight is assigned to each similarity value, the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain the weighted sum similarity of the multiple similarity values, wherein the weighted calculation The similarity value is used to indicate the degree of matching between the first term and the second term.
  2. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    在第一术语系统中指定一个术语,作为所述第一术语,在第二术语系统中取一个术语,作为所述第二术语。Specify a term in the first terminology system as the first term, and take a term in the second terminology system as the second term.
  3. 根据权利要求2所述的方法,在所述得到所述多个相似度值的加权求和相似度之后,还包括:The method according to claim 2, after said obtaining the weighted sum similarity of the multiple similarity values, further comprising:
    多次改变所述第二术语的取值,每次改变所述第二术语的取值则进行一次加权求和相似度计算,以生成多个加权求和相似度,其中,所述多个加权求和相似度中的最大值用于表示所述第一术语与所述第二术语系统中的第二术语的匹配度。The value of the second term is changed multiple times, and each time the value of the second term is changed, a weighted summation similarity calculation is performed to generate multiple weighted summation similarities, wherein the multiple weighted The maximum value in the sum similarity is used to indicate the degree of matching between the first term and the second term in the second terminology system.
  4. 根据权利要求2所述的方法,在所述得到所述多个相似度值的加权求和相似度之后,还包括:The method according to claim 2, after said obtaining the weighted sum similarity of the multiple similarity values, further comprising:
    多次改变所述第一术语的取值和所述第二术语的取值,每次改变所述第一术语的取值和所述第二术语的取值则进行一次加权求和相似度计算,以生成多个加权求和相似度;Change the value of the first term and the value of the second term multiple times, and perform a weighted sum similarity calculation every time the value of the first term and the value of the second term are changed , To generate multiple weighted sum similarities;
    对所述多个加权求和相似度进行求和运算,生成总匹配度值,其中,所述总匹配度值用于表示所述第一术语系统和所述第二术语系统的匹配度。A summing operation is performed on the multiple weighted summation similarities to generate a total matching degree value, where the total matching degree value is used to represent the matching degree between the first terminology system and the second terminology system.
  5. 根据权利要求4所述的方法,其中,所述为每个相似度值赋予权重,所述多个相似度值分别与对应的权重相乘,乘积结果相加,得到所述多个相似度值的加权求和相似度,包括:The method according to claim 4, wherein the weight is assigned to each similarity value, the multiple similarity values are respectively multiplied by the corresponding weights, and the product results are added to obtain the multiple similarity values The weighted summation similarity includes:
    通过多种权重组合对所述多个相似度值进行加权求和,得到每种权重组对应的多个加权求和相似度;Performing a weighted summation on the multiple similarity values through multiple weight combinations to obtain multiple weighted sum similarities corresponding to each weight reorganization;
    所述对所述多个加权求和相似度进行求和运算,生成总匹配度值,包括:The performing a summing operation on the multiple weighted summation similarities to generate a total matching degree value includes:
    对应所述多种权重组合,分别对所述多个加权求和相似度进行求和运算,生成多个总匹配度值;Corresponding to the multiple weight combinations, performing a summation operation on the multiple weighted summation similarities respectively to generate multiple total matching degree values;
    记录所述多个总匹配度值中的最大值,其中,所述多个总匹配度值中的最 大值用于表示所述第一术语系统与所述第二术语系统的匹配结果。The maximum value of the plurality of total matching degree values is recorded, wherein the maximum value of the plurality of total matching degree values is used to represent the matching result of the first terminology system and the second terminology system.
  6. 根据权利要求1至5中任一项所述的方法,其中,所述根据多种相似度计算算法分别计算出第一术语与第二术语的多个相似度值,,包括:The method according to any one of claims 1 to 5, wherein the calculating the multiple similarity values of the first term and the second term respectively according to multiple similarity calculation algorithms includes:
    分别计算所述第一术语与所述第二术语的余弦相似度值、杰卡德相似度值以及哈希相似度值。The cosine similarity value, the Jackard similarity value and the hash similarity value of the first term and the second term are respectively calculated.
  7. 根据权利要求6所述的方法,其中,所述计算所述第一术语与所述第二术语的余弦相似度值,包括:The method according to claim 6, wherein said calculating the cosine similarity value of said first term and said second term comprises:
    基于分词词典对所述第一术语与所述第二术语进行分词,基于停用词词典对所述第一术语与所述第二术语进行去停用词,生成对应于所述第一术语的第一词组列表和对应于所述第二术语的第二词组列表;The first term and the second term are segmented based on the word segmentation dictionary, and the first term and the second term are removed based on the stop word dictionary to generate the corresponding A first phrase list and a second phrase list corresponding to the second term;
    对所述第一词组列表和所述第二词组列表进行编码,得到对应于所述第一词组列表的第一词频向量以及对应于所述第二词组列表的第二词频向量;Encoding the first phrase list and the second phrase list to obtain a first term frequency vector corresponding to the first phrase list and a second term frequency vector corresponding to the second phrase list;
    计算所述第一词频向量和第二词频向量之间的余弦值,其中,所述余弦值为所述第一词频向量和第二词频向量的余弦相似度值。Calculate the cosine value between the first word frequency vector and the second word frequency vector, where the cosine value is the cosine similarity value of the first word frequency vector and the second word frequency vector.
  8. 根据权利要求6所述的方法,其中,所述计算所述第一术语与所述第二术语的杰卡德相似度值,包括:The method according to claim 6, wherein the calculating the Jaccard similarity value of the first term and the second term comprises:
    基于分词词典对所述第一术语与所述第二术语进行分词,基于停用词词典对所述第一术语与所述第二术语进行去停用词,生成对应于所述第一术语的第一词组列表和对应于所述第二术语的第二词组列表;The first term and the second term are segmented based on the word segmentation dictionary, and the first term and the second term are removed based on the stop word dictionary to generate the corresponding A first phrase list and a second phrase list corresponding to the second term;
    对所述第一词组列表和所述第二词组列表进行编码,得到对应于所述第一词组列表的第一词频向量以及对应于所述第二词组列表的第二词频向量;Encoding the first phrase list and the second phrase list to obtain a first term frequency vector corresponding to the first phrase list and a second term frequency vector corresponding to the second phrase list;
    计算所述第一词频向量与所述第二词频向量的交集与并集的比值,以获取杰卡德相似度值。Calculate the ratio of the intersection and union of the first word frequency vector and the second word frequency vector to obtain the Jeckard similarity value.
  9. 根据权利要求6所述的方法,其中,所述计算所述第一术语与所述第二术语的哈希相似度值,包括:The method according to claim 6, wherein said calculating the hash similarity value of the first term and the second term comprises:
    基于分词词典对所述第一术语与所述第二术语进行分词,基于停用词词典对所述第一术语与所述第二术语进行去停用词,生成对应于所述第一术语的第一词组列表和对应于所述第二术语的第二词组列表;The first term and the second term are segmented based on the word segmentation dictionary, and the first term and the second term are removed based on the stop word dictionary to generate the corresponding A first phrase list and a second phrase list corresponding to the second term;
    将所述第一词组列表和所述第二词组列表中的每个词转换为哈希值数字串,所述哈希值数字串乘以词的权重,得到每个词的序列串;Converting each word in the first phrase list and the second phrase list into a hash value digital string, and the hash value digital string is multiplied by the weight of the word to obtain a sequence string of each word;
    将所述第一词组列表中的多个词的序列串相加,得到对应于所述第一词组列表的第一术语序列串,将所述第二词组列表中的多个词的序列串相加,得到 对应于所述第二词组列表的第二术语序列串;The sequence strings of multiple words in the first phrase list are added together to obtain the first term sequence string corresponding to the first phrase list, and the sequence strings of multiple words in the second phrase list are combined with each other. Add to obtain the second term sequence string corresponding to the second phrase list;
    分别将所述第一术语序列串和所述第二术语序列串转换为二进制串;Respectively converting the first term sequence string and the second term sequence string into a binary string;
    计算所述第一术语序列串的二进制串和所述第二术语序列串的二进制串之间的海明距离;Calculating the Hamming distance between the binary string of the first term sequence string and the binary string of the second term sequence string;
    根据所述海明距离确定所述第一术语与所述第二术语之间的哈希相似度值,。Determine the hash similarity value between the first term and the second term according to the Hamming distance.
  10. 一种术语匹配装置,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的程序,所述程序被所述处理器执行时能够实现如权利要求1至9中任一项所述的术语匹配方法。A term matching device, comprising: a memory, a processor, and a program stored on the memory and running on the processor, the program being executed by the processor can be implemented as in claims 1 to 9 Any term matching method.
  11. 一种终端,包括:A terminal including:
    如权利要求10所述的术语匹配装置。The term matching device according to claim 10.
  12. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被执行时,实现如权利要求1至9中任一项所述的术语匹配方法。A computer-readable storage medium storing a computer program, wherein when the computer program is executed, the term matching method according to any one of claims 1 to 9 is implemented.
PCT/CN2020/079603 2019-09-16 2020-03-17 Term matching method and apparatus, terminal, and computer readable storage medium WO2021051763A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910869178.2 2019-09-16
CN201910869178.2A CN112507107A (en) 2019-09-16 2019-09-16 Term matching method, device, terminal and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021051763A1 true WO2021051763A1 (en) 2021-03-25

Family

ID=74883421

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/079603 WO2021051763A1 (en) 2019-09-16 2020-03-17 Term matching method and apparatus, terminal, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN112507107A (en)
WO (1) WO2021051763A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470829A (en) * 2021-07-23 2021-10-01 平安科技(深圳)有限公司 User portrait generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10214271A (en) * 1996-11-28 1998-08-11 Nippon Telegr & Teleph Corp <Ntt> Method and device for word association and storage medium storing word association program
US20080294457A1 (en) * 2007-05-25 2008-11-27 Cordery Robert A Real-time medical records
US20130254220A1 (en) * 2009-01-30 2013-09-26 Lexisnexis Methods and systems for creating and using an adaptive thesaurus
CN109192258A (en) * 2018-08-14 2019-01-11 平安医疗健康管理股份有限公司 Medical data method for transformation, device, computer equipment and storage medium
CN109582961A (en) * 2018-11-28 2019-04-05 重庆邮电大学 A kind of efficient robot data similarity calculation algorithm
CN109753555A (en) * 2018-11-30 2019-05-14 平安科技(深圳)有限公司 Word match method, apparatus, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021553A (en) * 2017-09-30 2018-05-11 北京颐圣智能科技有限公司 Word treatment method, device and the computer equipment of disease term
CN107977347B (en) * 2017-12-04 2021-12-21 海南云江科技有限公司 Topic duplication removing method and computing equipment
CN109255021A (en) * 2018-11-01 2019-01-22 北京京航计算通讯研究所 Data query method based on quality text similarity

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10214271A (en) * 1996-11-28 1998-08-11 Nippon Telegr & Teleph Corp <Ntt> Method and device for word association and storage medium storing word association program
US20080294457A1 (en) * 2007-05-25 2008-11-27 Cordery Robert A Real-time medical records
US20130254220A1 (en) * 2009-01-30 2013-09-26 Lexisnexis Methods and systems for creating and using an adaptive thesaurus
CN109192258A (en) * 2018-08-14 2019-01-11 平安医疗健康管理股份有限公司 Medical data method for transformation, device, computer equipment and storage medium
CN109582961A (en) * 2018-11-28 2019-04-05 重庆邮电大学 A kind of efficient robot data similarity calculation algorithm
CN109753555A (en) * 2018-11-30 2019-05-14 平安科技(深圳)有限公司 Word match method, apparatus, equipment and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470829A (en) * 2021-07-23 2021-10-01 平安科技(深圳)有限公司 User portrait generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112507107A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN110032728B (en) Conversion method and device for disease name standardization
Ding et al. A Hybrid Feature Selection Algorithm Based on Information Gain and Sequential Forward Floating Search①
WO2019208070A1 (en) Question/answer device, question/answer method, and program
WO2021213160A1 (en) Medical query method and apparatus based on graph neural network, and computer device and storage medium
CN113470664B (en) Voice conversion method, device, equipment and storage medium
WO2021051763A1 (en) Term matching method and apparatus, terminal, and computer readable storage medium
WO2022016995A1 (en) Question and answer library construction method and apparatus, and electronic device and storage medium
CN113539408A (en) Medical report generation method, training device and training equipment of model
CN110991182A (en) Word segmentation method and device for professional field, storage medium and electronic equipment
CN113409827B (en) Voice endpoint detection method and system based on local convolution block attention network
CN114820450A (en) CT angiography image classification method suitable for Li&#39;s artificial liver treatment
Teixeira et al. Patient Privacy in Paralinguistic Tasks.
CN111897982B (en) Medical CT image storage and retrieval method
Moree et al. A computational history of prime numbers and Riemann zeros
Brown et al. Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
CN116743349A (en) Paillier ciphertext summation method, system, device and storage medium
CN116757783A (en) Product recommendation method and device
WO2022166676A1 (en) Method and apparatus for estimating segmented word frequency in differential privacy protection data
CN115374251A (en) Dense retrieval method based on syntax comparison learning
CN110175220B (en) Document similarity measurement method and system based on keyword position structure distribution
Starikovskaya et al. $ L_p $ Pattern Matching in a Stream
Voges et al. A two-level scheme for quality score compression
WO2021072892A1 (en) Legal provision search method based on neural network hybrid model, and related device
CN113554053B (en) Method for comparing similarity of traditional Chinese medicine prescriptions
CN113688119B (en) Medical database construction method based on artificial intelligence and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20865313

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 250822)

122 Ep: pct application non-entry in european phase

Ref document number: 20865313

Country of ref document: EP

Kind code of ref document: A1