CN111931477B - Text matching method and device, electronic equipment and storage medium - Google Patents

Text matching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111931477B
CN111931477B CN202011045975.8A CN202011045975A CN111931477B CN 111931477 B CN111931477 B CN 111931477B CN 202011045975 A CN202011045975 A CN 202011045975A CN 111931477 B CN111931477 B CN 111931477B
Authority
CN
China
Prior art keywords
word
text
candidate
matching
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011045975.8A
Other languages
Chinese (zh)
Other versions
CN111931477A (en
Inventor
陈曦
向玥佳
刘博�
林镇溪
文瑞
管冲
孙继超
高文龙
张子恒
许祈馨
徐超
杨奕凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011045975.8A priority Critical patent/CN111931477B/en
Publication of CN111931477A publication Critical patent/CN111931477A/en
Application granted granted Critical
Publication of CN111931477B publication Critical patent/CN111931477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The application discloses a text matching method, a text matching device, electronic equipment and a storage medium, wherein the text matching method comprises the following steps: acquiring a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, wherein the reference dictionary is a dictionary in the field to which the content of the text to be matched belongs, and comprises at least one reference word; combining a plurality of text single words to obtain candidate words associated with at least one reference word meaning; generating a matching degree between the candidate word and a target reference word according to the editing distance between the candidate word and at least one reference word in a target matching type; fusing the matching degree between the candidate words and the target reference words under each matching type; and selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and outputting the reference text.

Description

Text matching method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text matching method, apparatus, electronic device, and storage medium.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will relate to natural language, i.e. the language used by people daily, so it is closely related to the research of linguistics, and natural language processing techniques generally include text processing, semantic understanding, machine translation, robot question and answer, and knowledge mapping.
The text matching is used as an application direction in text processing, and plays an important role in real life, such as tasks of thesis duplicate finding or online disease query of medical scenes. Current text matching algorithms typically determine whether two texts match based on edit distance, which uses either words as minimum units or words as minimum units. The editing distance algorithm with the characters as the minimum units can avoid word segmentation errors and can better deal with the problems of non-standard expression and wrongly written characters, but the minimum units of natural language are words, and the method is difficult to utilize a large amount of prior knowledge based on the words. The editing distance algorithm with words as minimum units can utilize a large amount of prior knowledge, but the effect is often influenced by word segmentation errors, so that the accuracy of the current text matching method is low.
Disclosure of Invention
The application provides a text matching method, a text matching device, electronic equipment and a storage medium, which can improve the accuracy of text matching.
The application provides a text matching method, which comprises the following steps:
acquiring a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, wherein the reference dictionary is a dictionary in the field to which the content of the text to be matched belongs, and comprises at least one reference word;
combining a plurality of text single words to obtain candidate words associated with at least one reference word meaning;
generating a matching degree between the candidate word and a target reference word according to the editing distance between the candidate word and the reference word under at least one matching type;
fusing the matching degree between the candidate words and the target reference words under each matching type;
and selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and outputting the reference text.
Correspondingly, this application still provides a text matching device, includes:
the device comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, the reference dictionary is a dictionary in the field to which the content of the text to be matched belongs, and the reference dictionary comprises at least one reference word;
the combination module is used for combining a plurality of text single words to obtain candidate words associated with at least one reference word meaning;
the generating module is used for generating the matching degree between the candidate word and the target reference word according to the editing distance between the candidate word and the reference word under at least one matching type;
the fusion module is used for fusing the matching degree between the candidate words and the target reference words under each matching type;
and the output module is used for selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result and outputting the reference text.
Optionally, in some embodiments of the present application, the generating module includes:
the determining submodule is used for determining a target matching type corresponding to each reference word according to the semantic association relation between the candidate words and the reference words;
the calculation sub-module is used for calculating the editing distance between the candidate word and at least one reference word in the determined target matching type based on the determined target matching type, and determining the reference word with the minimum editing distance as a target reference word;
and the generating sub-module is used for generating the matching degree between the candidate word and the target reference word according to the editing distance between the candidate word and the target reference word under at least one matching type.
Optionally, in some embodiments of the present application, the calculation submodule includes:
the first calculating unit is used for calculating a first editing distance between the candidate word and at least one reference word under the synonym matching type, and determining the reference word with the minimum first editing distance as a first target reference word;
the second calculating unit is used for calculating a second editing distance between the candidate word and at least one reference word under the hypernym matching type, and determining the reference word with the minimum second editing distance as a second target reference word;
and the third calculating unit is used for calculating a third editing distance between the candidate word and at least one reference word under the matching type of the weighted words, and determining the reference word with the minimum third editing distance as a third target reference word.
Optionally, in some embodiments of the present application, the first computing unit is specifically configured to:
selecting a synonym cluster set in the reference dictionary, wherein the synonym cluster set comprises a plurality of synonym clusters, and each synonym cluster comprises at least two reference words with the same word sense;
determining a synonym cluster with the same semantic meaning as the candidate word to obtain a target synonym cluster;
calculating a first editing distance between the candidate word and each reference word in the target synonym cluster, and determining a first target reference word from the reference word with the minimum first editing distance;
the generation submodule is specifically configured to: and generating a first matching degree between the candidate word and the first target reference word according to a first editing distance between the candidate word and the first target reference word.
Optionally, in some embodiments of the present application, the second calculating unit is specifically configured to:
determining the superior-inferior relation between the candidate word and at least one reference word according to the semantics of the candidate word and the semantics of each reference word;
calculating a second editing distance between the candidate word and the corresponding superior reference word based on the determined superior-inferior relation, and determining the reference word with the minimum second editing distance as a second target reference word;
the generation submodule is specifically configured to: and generating a second matching degree between the candidate word and a second target reference word according to a second editing distance between the candidate word and the second target reference word.
Optionally, in some embodiments of the present application, the third computing unit includes:
the acquisition subunit is used for acquiring a weight value pre-established by each reference word;
the calculating subunit is used for calculating the similarity between the candidate word and each reference word and determining the reference word with the similarity larger than a preset value as a word to be selected;
the determining subunit is configured to calculate a third editing distance between the candidate word and the determined candidate word according to the weight of the determined candidate word, and determine the candidate word with the minimum third editing distance as a third target reference word;
the generation submodule is specifically configured to: and generating a third matching degree between the candidate word and a third target reference word according to a third editing distance between the candidate word and the third target reference word.
Optionally, in some embodiments of the present application, the determining subunit is specifically configured to:
and calculating a third editing distance between the candidate word and the candidate word with the weight smaller than the preset weight, and determining the candidate word with the minimum third editing distance as a third target reference word.
Optionally, in some embodiments of the present application, the fusion module is specifically configured to:
acquiring a preset weight coefficient corresponding to each matching type;
calculating the product of the obtained weight coefficient and the matching degree between the candidate word and the target reference word under the corresponding matching type to obtain the weighted matching degree corresponding to each matching type;
and fusing the weighted matching degrees corresponding to the matching types.
Optionally, in some embodiments of the present application, the combination module is specifically configured to:
identifying the part of speech of each text single character;
and removing the text single words with parts of speech as auxiliary words, and arranging and combining the reserved text single words to obtain candidate words associated with at least one reference word meaning.
The method comprises the steps of firstly obtaining a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, wherein the reference dictionary is a dictionary belonging to the field of the content of the text to be matched, wherein the reference dictionary comprises at least one reference word, then combining a plurality of text words to obtain a candidate word associated with at least one reference word meaning, generating the matching degree between the candidate words and the target reference words according to the editing distance between the candidate words and the target reference words under at least one matching type, then, fusing the matching degree between the candidate words and the target reference words under each matching type, and finally, and selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and outputting the reference text, so that the text matching accuracy can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1a is a schematic view of a scene of a text matching method provided in the present application;
FIG. 1b is a schematic flow chart of a text matching method provided herein;
FIG. 2a is another schematic flow chart of a text matching method provided in the present application;
FIG. 2b is a schematic diagram of another scenario of the text matching method provided in the present application;
FIG. 3 is a schematic structural diagram of a text matching apparatus provided in the present application;
fig. 4 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application provides a text matching method and device, electronic equipment and a storage medium.
The text matching device can be specifically integrated in a server, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and the text matching device can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network), big data and artificial intelligence platforms and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
For example, referring to fig. 1a, the present application provides a text matching apparatus, hereinafter referred to as a matching apparatus, which is integrated in a server, for example, a user queries "what is a fever symptom" through a terminal, when the server receives a text matching request sent by the terminal, the server obtains a text to be matched including a plurality of text words and a reference dictionary corresponding to the text to be matched, the reference dictionary being a dictionary in a field to which contents of the text to be matched belong, wherein the reference dictionary includes at least one reference word, then the server combines the plurality of text words to obtain a candidate word associated with at least one reference word meaning, then the server generates a matching degree between the candidate word and a target reference word according to an edit distance between the candidate word and the target reference word in at least one matching type, and then the server fuses the matching degree between the candidate word and the target reference word in each matching type, and finally, the server selects a reference text matched with the text to be matched from a preset reference text library according to the fusion result and outputs the reference text.
According to the text matching method, after the matching degree between the candidate word and the target reference word is generated according to the editing distance between the candidate word and the target reference word in at least one matching type, the matching degree between the candidate word and the target reference word in each matching type is fused, a large amount of word-based priori knowledge can be utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, and the situation that errors occur when the words of the text to be matched are segmented can be avoided by taking the characters as units, so that the accuracy of text matching is improved.
The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.
The nouns from which the application appears are explained first:
dictionary: the word stock is a collection of word data, and is stored in the database to be called by searching.
Editing distance: and calculating the minimum editing operand required for converting one character string into another character string, and quantifying the difference degree of the two character strings. The conversion operations include adding a character, deleting a character, and replacing a character.
The upper and lower relation: when two words have an inclusive and included relationship, we can refer to them as having a superordinate and subordinate relationship. The term representing the upper concept is an upper term, and the term representing the lower concept is a lower term. The upper and lower relation of the words has hierarchy and transferability.
A text matching method, comprising: the method comprises the steps of obtaining a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, combining the text single words to obtain a candidate word associated with at least one reference word meaning, generating a matching degree between the candidate word and a target reference word according to an editing distance between the candidate word and the target reference word in at least one matching type, fusing the matching degree between the candidate word and the target reference word in each matching type, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text.
Referring to fig. 1b, fig. 1b is a schematic flow chart of a text matching method provided in the present application. The specific process of the text matching method can be as follows:
101. and acquiring a text to be matched containing a plurality of text single characters and a reference dictionary corresponding to the text to be matched.
The reference dictionary is a dictionary belonging to the field of the content of the text to be matched, and includes at least one reference word, for example, if the field of the content of the text to be matched is a medical field, the corresponding reference dictionary is a medical dictionary, and specifically, the text to be matched and the reference dictionary corresponding to the text to be matched may be stored in a local database or pulled through accessing a network interface, and are specifically determined according to actual conditions.
102. And combining the plurality of text single words to obtain a candidate word associated with at least one reference word meaning.
It can be understood that the words and the words belong to text information of different dimensions, and the text matching algorithm with the words as the minimum unit is based on the text information of the present application, and the dictionary is fused to obtain the text matching the text to be matched, so the text words need to be combined, wherein a plurality of text words can be combined based on the phonemes of the text words, it needs to be noted that the phoneme of the word is the minimum unit or the minimum voice segment forming the syllable, and the phoneme specifically refers to each pinyin in the chinese pinyin, such as a, o, e, b, p, m, and the like, which contains 23 initials and 24 finals, and all finals also include 5 tones, which are 1 tone, 2 tones, 3 tones, 4 tones, and soft tones, so that pronunciations of 23+24 × 5 — 143 different phonemes can be collected.
However, combining text-based phonemes can be the case: for example, "simple" and "quarantine", where "simple" and "quarantine" are homophones, and "easy" and "epidemic" are homophones, and when combined, meaningless phrases such as "simple quarantine" and "easy" may appear, resulting in poor subsequent text matching effect, so that the present application combines a plurality of text words based on the part of speech of different text words, that is, optionally, the step "combines a plurality of text words to obtain a candidate word associated with at least one reference word meaning" may specifically include:
(11) identifying the part of speech of each text single character;
(12) and removing the text single words with parts of speech as auxiliary words, and arranging and combining the reserved text single words to obtain candidate words associated with at least one reference word meaning.
The auxiliary words are also called auxiliary words. The grammatical term refers to a part of speech belonging to a null word attached to other words, phrases or sentences for auxiliary purposes. Generally used for expressing various moods before, during and after sentences; or the method is used in the middle of sentences to represent the structural relationship, for example, for 'quarantine of people entering the area in the building of the simple building', the 'I' in the text is a help word, in the scheme of the application, the text single word with part of speech as the help word is removed, the removal of the help word does not affect the semantics of other words in the text, and meanwhile, the task amount of text single word combination can be reduced, and the efficiency of subsequent text matching is improved.
For example, in a text of "quarantine is performed on people entering a region in a building of a simple building", text words from which auxiliary words are removed include "in", "simple", "easy", "building", "house", "middle", "pair", "enter", "region", "in", "person", "enter", "go", "check", and "quarantine", so that the remaining text words are arranged and combined to obtain a candidate word "in the building", and certainly, other candidate words are also included.
It should be noted that, in the present application, the candidate words obtained by combining are words semantically related to at least one reference word, such as hypernyms, synonyms, or the same words.
103. And generating the matching degree between the candidate word and the target reference word according to the editing distance between the candidate word and at least one reference word in the target matching type.
In this application, the matching type refers to a word matching manner of the candidate word and the target reference word, for example, in the synonym matching type, synonym matching is performed on the candidate word and the target reference word, that is, whether the candidate word and the target reference word are synonyms is judged, specifically, a target matching type corresponding to each reference word may be determined according to a semantic association relationship between the candidate word and the reference word, and then, according to the determined target matching type, a matching degree between the candidate word and the target reference word is generated, that is, optionally, in some embodiments, the step "generating the matching degree between the candidate word and the target reference word according to an edit distance between the candidate word and at least one reference word in the target matching type" may specifically include:
(21) determining a target matching type corresponding to each reference word according to the semantic association relationship between the candidate words and the reference words;
(22) calculating the editing distance between the candidate word and at least one reference word in the determined target matching type based on the determined target matching type, and determining the reference word with the minimum editing distance as a target reference word;
(23) and generating the matching degree between the candidate words and the target reference words according to the editing distance between the candidate words and the target reference words under at least one matching type.
For example, if the candidate word "rose" is a hyponym of the reference word "flower", the semantic association relationship between the candidate word and the reference word is a hyponym association relationship, the target matching type corresponding to the reference word "flower" is determined to be a hypernym matching type, then the edit distance of the reference word "flower" in the hypernym matching type is calculated, and if the reference word having a hyponym relationship with the candidate word "rose" is only the reference word "flower", then the matching degree between the candidate word "rose" and the reference word "flower" is generated according to the edit distance of the candidate word "rose" and the reference word "flower" in the hypernym matching type, it can be understood that the candidate word in the hypernym matching type has a hyponym relationship with the reference words respectively, the edit distances of the candidate word and the reference words in the hypernym matching type are calculated, and determining the reference word with the minimum editing distance as the target reference word.
For another example, the candidate word "salt" is a synonym of the reference word "sodium chloride", and the semantic association relationship between the candidate word and the reference word is a synonym association relationship, and the target matching type corresponding to the reference word "sodium chloride" is determined as a synonym matching type, then, the edit distance of the reference word "sodium chloride" in the synonym matching type is calculated, and assuming that the reference word having a synonym association with the candidate word "salt" has only the reference word "sodium chloride", then, the matching degree between the candidate word "salt" and the reference word "sodium chloride" is generated according to the edit distance between the candidate word "salt" and the reference word "sodium chloride" in the synonym matching type, and, as such, and respectively enabling the candidate words in the synonym matching type to have a superior-subordinate relation with the multiple reference words, calculating the editing distance between the candidate words and the multiple reference words in the synonym matching type, and determining the reference word with the minimum editing distance as the target reference word.
For another example, a corresponding weight may be given to the reference word in advance, for example, the weight given to "severity" is 0.5, and the weight given to "seem" is 0.1, wherein in the matching type of the weighted word, a relationship between the weight and the edit distance may be set according to an actual situation, for example, the edit distance corresponding to the weight "0.5" may be set to 0.1, the edit distance corresponding to the weight "0.1" may be set to 0.7, the edit distance corresponding to the weight "0.5" may also be set to 0.7, and the edit distance corresponding to the weight "0.5" may be set to 0.1, which is specifically selected according to an actual situation.
Optionally, in some embodiments, the three matching types may be combined to obtain a reference text matched with the text to be matched in the following step, that is, optionally, the step "based on the determined target matching type, calculating an edit distance between the candidate word and at least one reference word in the determined target matching type, and determining the reference word with the minimum edit distance as the target reference word" may specifically include:
(a) calculating a first editing distance between the candidate word and at least one reference word under the synonym matching type, and determining the reference word with the minimum first editing distance as a first target reference word;
(b) calculating a second editing distance between the candidate word and at least one reference word under the superior word matching type, and determining the reference word with the minimum second editing distance as a second target reference word;
(c) and calculating a third editing distance between the candidate word and at least one reference word under the matching type of the weighted words, and determining the reference word with the minimum third editing distance as a third target reference word.
In the present application, the order of execution of steps (a), (b), and (c) is not limited.
For step (a), that is, under a synonym matching type, firstly, a synonym cluster having the same semantic as that of the candidate word may be obtained, then, a first edit distance between the candidate word and each reference word in the target synonym cluster is calculated, and a first target reference word is determined for the reference word having the smallest first edit distance, and finally, according to the first edit distance between the candidate word and the first target reference word, a first matching degree between the candidate word and the first target reference word is generated, that is, optionally, in some embodiments, the step "calculating a first edit distance between the candidate word and at least one reference word under the synonym matching type, and determining the reference word having the smallest first edit distance as the first target reference word" may specifically include:
(31) selecting a set of synonym clusters in a reference dictionary;
(32) determining synonym clusters with the same semantics as the candidate words to obtain target synonym clusters;
(33) and calculating a first editing distance between the candidate word and each reference word in the target synonym cluster, and determining the reference word with the minimum first editing distance as a first target reference word.
The synonym cluster set comprises a plurality of synonym clusters, each synonym cluster comprises at least two reference words with the same word meaning, when a synonym dictionary is built, the reference words with the same meaning can be clustered in advance, and therefore the synonym cluster set comprising the synonym clusters is formed.
For step (b), that is, under the hypernym matching type, first, determining a superior-subordinate relationship between a candidate word and a reference word according to semantics of the candidate word and semantics of each reference word, then, based on the determined superior-subordinate relationship, calculating a second edit distance between the candidate word and the corresponding hypernym reference word, and determining a reference word with the smallest second edit distance as a second target reference word, and finally, according to the second edit distance between the candidate word and the second target reference word, generating a second matching degree between the candidate word and the second target reference word, that is, optionally, in some embodiments, the step "calculating a second edit distance between the candidate word and at least one reference word under the hypernym matching type, and determining a reference word with the smallest second edit distance as the second target reference word" may specifically include:
(41) determining the superior-inferior relation between the candidate word and at least one reference word according to the semantics of the candidate word and the semantics of each reference word;
(42) and calculating a second editing distance between the candidate word and the corresponding upper reference word based on the determined upper and lower relation, and determining the reference word with the minimum second editing distance as a second target reference word.
It should be noted that the candidate word may be an hypernym of some reference words or a hyponym of some reference words, and unlike a synonym, the hypernym and the hyponym are ordered, and in the process of text matching, it is reasonable to replace the hyponym with the hypernym, and it is unreasonable to replace the hypernym with the hyponym, for example, the candidate word is "flower" and the reference word is "rose", and if the candidate word is replaced with the reference word, it may make the subsequent text matching wrong, for example, for a text to be matched, "this flower belongs to the class of monocotyledons", the candidate word "flower" is replaced with the reference word "rose", and the final result is: "the rose belongs to the class of monocotyledons", and the rose belongs to the class of dicotyledons, which is obviously different from the meaning of the original text, so in the present application, the reference words of the subordinate relationship are removed, that is, the superior relationship between the candidate word and at least one reference word is determined according to the semantics of the candidate word and the semantics of each reference word, then, based on the determined superior relationship, the second edit distance between the candidate word and the corresponding superior reference word is calculated, and the reference word with the smallest second edit distance is determined as the second target reference word.
For step (c), that is, under the condition of a weighted word matching type, a weighted value of each reference word may be collected, then a similarity between the candidate word and each reference word is calculated, the reference word with the similarity larger than a preset value is determined as a candidate word, then a third edit distance between the candidate word and the determined candidate word is calculated according to the determined weight of the candidate word, the candidate word with the minimum third edit distance is determined as a third target reference word, and finally, a third matching degree between the candidate word and the third target reference word is generated according to the third edit distance between the candidate word and the third target reference word, wherein for different tasks, the preset value may be adjusted according to actual conditions, in order to improve the accuracy of subsequent text matching, the preset value may be set to 100%, of course, or may be another value, for example, in a scenario of searching for a duplicate, the preset value may be set to 80%, that is, optionally, in some embodiments, the step "calculating a third edit distance between the candidate word and at least one reference word in the weighted word matching type, and determining the reference word with the smallest third edit distance as a third target reference word" may specifically include:
(51) acquiring a weight value pre-established for each reference word;
(52) calculating the similarity between the candidate words and each reference word, and determining the reference words with the similarity larger than a preset value as the words to be selected;
(53) and calculating a third editing distance between the candidate word and the determined candidate word according to the weight of the determined candidate word, and determining the candidate word with the minimum third editing distance as a third target reference word.
Optionally, in some embodiments, during the dynamic programming algorithm, if the weight of the reference word corresponding to the candidate word is higher, the cost of operating the candidate word is higher, that is, the edit distance is larger, and the weight of the reference word corresponding to the candidate word is lower, the cost of operating the candidate word is smaller, that is, the edit distance is larger, that is, step "calculate a third edit distance between the candidate word and the determined candidate word according to the determined weight of the candidate word, and determine the candidate word with the minimum third edit distance as a third target reference word", which may specifically include: and calculating a third editing distance between the candidate word and the candidate word with the weight smaller than the preset weight, and determining the candidate word with the minimum third editing distance as a third target reference word.
104. And fusing the matching degree between the candidate words and the target reference words under each matching type.
In an actual application scenario, synonyms, superior and inferior words, and words with higher weight values often do not appear alone, and therefore, in order to improve accuracy of text matching, matching degrees between candidate words and target reference words under various matching types are fused in the present application, where for different application scenarios, the proportion occupied by reference words of different types is different, and specifically, the selection may be performed according to an actual situation, that is, optionally, in some embodiments, the step "fusing matching degrees between candidate words and target reference words under various matching types" may specifically include:
(61) acquiring a preset weight coefficient corresponding to each matching type;
(62) calculating the product of the obtained weight coefficient and the matching degree between the candidate word and the target reference word under the corresponding matching type to obtain the weighted matching degree corresponding to each matching type;
(63) and fusing the weighted matching degrees corresponding to the matching types.
The weighting coefficients are pre-constructed according to different tasks, and in actual application, the matching degrees between the candidate words and the target reference words under each matching type are weighted according to the determined weighting coefficients to finally obtain a fusion result, and then step 105 is executed.
105. And selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and outputting the reference text.
The reference text library includes a plurality of reference texts, the reference texts may be a phrase, a sentence or a paragraph, and specifically, the reference texts matched with the text to be matched may be determined in a preset reference text library according to the fusion result, and the determined reference texts are output.
The text matching method comprises the steps of obtaining a text to be matched containing a plurality of text words and a reference dictionary corresponding to the text to be matched, combining the text words to obtain a candidate word associated with at least one reference word meaning, generating a matching degree between the candidate word and a target reference word according to an editing distance of the candidate word and the target reference word in at least one matching type, fusing the matching degree between the candidate word and the target reference word in each matching type, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text, fusing the matching degree between the candidate word and the target reference word in each matching type after generating the matching degree between the candidate word and the target reference word according to the editing distance of the candidate word and the target reference word in at least one matching type, the method has the advantages that a large amount of word-based prior knowledge can be utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, and the condition of errors in word segmentation of the text to be matched can be avoided by taking the word as a unit, so that the accuracy of text matching is improved.
The method according to the examples is further described in detail below by way of example.
In this embodiment, the text matching apparatus will be described by taking an example in which it is specifically integrated in a terminal.
Referring to fig. 2a, a text matching method may specifically include the following steps:
201. the terminal obtains a text to be matched containing a plurality of text single characters and a reference dictionary corresponding to the text to be matched.
The reference dictionary is a dictionary belonging to the field of the content of the text to be matched, and the reference dictionary includes at least one reference word, for example, if the field of the content of the text to be matched is a medical field, the corresponding reference dictionary is a medical dictionary, specifically, the text to be matched and the reference dictionary corresponding to the text to be matched may be stored in a local database, or may be obtained by pulling the text to be matched through an access network interface, and is specifically determined according to the actual situation.
202. And the terminal combines the plurality of single text words to obtain a candidate word associated with at least one reference word meaning.
For example, specifically, the terminal may identify a part of speech of each text word, then remove the text word whose part of speech is a help word, and perform permutation and combination on the remaining text words to obtain a candidate word associated with at least one reference word sense.
203. And the terminal generates a first matching degree between the candidate word and the first target reference word according to a first editing distance between the candidate word and the first target reference word.
For example, specifically, under the synonym matching type, the terminal may first obtain a synonym cluster having the same semantic as the candidate word, then calculate a first edit distance between the candidate word and each reference word in the target synonym cluster, determine a first target reference word from the reference word having the smallest first edit distance, and finally generate a first matching degree between the candidate word and the first target reference word according to the first edit distance between the candidate word and the first target reference word.
204. And the terminal generates a second matching degree between the candidate word and the second target reference word according to a second editing distance between the candidate word and the second target reference word.
For example, specifically, under the hypernym matching type, the terminal needs to determine the superior-subordinate relationship between the candidate word and the reference word according to the semantics of the candidate word and the semantics of each reference word, then the terminal calculates a second editing distance between the candidate word and the corresponding hypernym based on the determined superior-subordinate relationship, determines the reference word with the smallest second editing distance as a second target reference word, and finally, the terminal generates a second matching degree between the candidate word and the second target reference word according to the second editing distance between the candidate word and the second target reference word.
205. And the terminal generates a third matching degree between the candidate word and the third target reference word according to a third editing distance between the candidate word and the third target reference word.
For example, specifically, under the type of weighted word matching, the terminal may collect a weighted value of each reference word, then calculate a similarity between the candidate word and each reference word, determine the reference word with the similarity larger than a preset value as a candidate word, then calculate a third edit distance between the candidate word and the determined candidate word according to the weight of the determined candidate word, determine the candidate word with the minimum third edit distance as a third target reference word, and finally generate a third matching degree between the candidate word and the third target reference word according to the third edit distance between the candidate word and the third target reference word.
It should be noted that, in the present application, the order of steps 203, 204, and 205 is not limited.
206. And the terminal fuses the synonym matching type, the hypernym matching type and the matching degree between the candidate words and the target reference words under the weight word matching type.
For example, specifically, the terminal may obtain a synonym matching type, an hypernym matching type, and a preset weight coefficient corresponding to weight word matching, then, the terminal calculates a product of the obtained weight coefficient and a matching degree between a candidate word and a target reference word under the corresponding matching type to obtain a weighted matching degree corresponding to each matching type, and finally, the terminal fuses the weighted matching degrees corresponding to each matching type.
207. And the terminal selects a reference text matched with the text to be matched from a preset reference text library according to the fusion result and outputs the reference text.
For example, specifically, the terminal may determine, according to the fusion result, a reference text matched with the text to be matched in a preset reference text library, and output the determined reference text.
After a terminal acquires a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, the terminal combines the text single words to obtain a candidate word associated with at least one reference word meaning, then the terminal generates a first matching degree between the candidate word and a first target reference word according to a first editing distance between the candidate word and the first target reference word, generates a second matching degree between the candidate word and a second target reference word according to a second editing distance between the candidate word and a second target reference word, and generates a third matching degree between the candidate word and a third target reference word according to a third editing distance between the candidate word and the third target reference word, then the terminal fuses a synonym matching type, an upper-word matching type and a matching degree between a weight word matching lower candidate word and the target reference word, according to the text matching method provided by the application, according to the matching degree of the candidate words with the corresponding target reference words under the synonym matching type, the hypernym matching type and the weight word matching type, and the matching degree of the candidate words with the target reference words under each matching type are fused, a large amount of word-based prior knowledge can be utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, and the situation of errors when the words of the text to be matched are segmented can be avoided by taking the words as a unit, so that the accuracy of text matching is improved.
In order to further understand the text matching scheme of the present application, a scene of online diagnosis is taken as an example for explanation, a disease often has multiple expression modes, and doctors often use words in case history at random, so that multiple expressions of the same disease need to be unified before statistical analysis of medical data, medical insurance data and case data, which is a disease standardization task. For example, as shown in fig. 2b, for the description "there is a foreign object at the vocal cords edge" in the medical record, the corresponding normalization result is "code: t17.900, standard expression: foreign matter in the respiratory tract ".
Taking the ICD10 standard as an example, the standard expression of more than thirty thousand diseases is common in the standard. For a nonstandard disease input text (text to be matched), a corresponding reference text needs to be selected from the thirty thousand standard expressions (reference texts). Specifically, the input text is compared with the 3 thousands of standard expressions one by one, and the closest one of the three is selected as the result of the model output.
In each comparison, two texts need to be processed. One of them is the input non-standard text (text to be matched) and the other is the standard expression (reference text), and the output is a numerical value representing the degree of correlation between the two texts.
First, nonstandard text input by a user is defined as a = (c1c2.. cn), and a standard expression is defined as B = (c 1 'c 2.. cn'), where c represents a word, that is, both a and B are ordered sequences of words, and for convenience of expression, the ordered sequences of words are words w = (c1c2.. cn), word length len (w) = n.
And then combining a plurality of text single words in the text to be matched to obtain a candidate word associated with at least one reference word meaning, wherein in the application, text matching is performed on the text to be matched based on the editing distances of three dimensions, and the text matching comprises the editing distance of the synonym, the editing distance of the hypernym and the editing distance of the weight word.
For the edit distance of synonyms, first, a synonym dictionary can be obtained
Figure 902553DEST_PATH_IMAGE001
Defining a group of synonyms as a synonym cluster
Figure 236582DEST_PATH_IMAGE002
For example, the three terms "urinary tract, and urethra" are synonymous with each other and constitute a cluster of synonyms. For example, three words such as "urinary tract, urinary tract and urethra" are synonyms of each other to form a synonym cluster, and the minimum word length and the maximum word length in the synonym dictionary are calculated, wherein the minimum word length is
Figure 578702DEST_PATH_IMAGE003
The maximum word length is
Figure 601278DEST_PATH_IMAGE004
That is, w is an arbitrary word in the synonym dictionary, and based on this, the edit distance calculation formula of the fused synonym dictionary is as follows:
Figure 319835DEST_PATH_IMAGE005
Figure 508371DEST_PATH_IMAGE006
Figure 286971DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 560958DEST_PATH_IMAGE008
represents the minimum edit distance between the text to be matched and the reference word in the synonym dictionary S, i =0 represents the initialization of the first row of the synonym relation matrix, j =0 represents the initialization of the first column of the synonym relation matrix,
Figure 83206DEST_PATH_IMAGE009
is a synonym relation matrix transfer equation, w is a token in the text to be matched, w' is a token in the synonym dictionary S,
Figure 890363DEST_PATH_IMAGE010
the method comprises the steps of representing that if a word w in a text to be matched and a word w ' of a reference text are synonyms, the editing distance between the word w in the text to be matched and the word w ' of the reference text is a, otherwise, calculating the minimum editing operation times required for converting the word w into the word w ', wherein a represents the weight of the synonyms, and the value of a is generally 0.1. When a dynamic programming algorithm is performed, whether the words ending with the word at the current position can form synonyms or not is judged. A smaller replacement cost would be used if synonyms could be constructed. For example: the input non-standard text is: "urinary tract infection", the standard expression of the current candidate is "urinary tract infection". If the algorithm currently prevailing is used, the calculated distance is 2, and if the algorithm of the present application is used, the calculated distance is 0.1. The smaller this distance, the more matched the two, the better the algorithm of the present application.
For edit distances of hypernyms, for example, "nodule" is a hyponym of "placeholder," placeholder "is a hypernym of" nodule ". Unlike synonyms, the supernumerary and subjacent numerics are ordered, and therefore,using ordered words
Figure 574285DEST_PATH_IMAGE011
Figure 335568DEST_PATH_IMAGE012
Defining the relation of upper and lower words, upper and lower dictionary position and multiple words set with upper and lower relation
Figure 395927DEST_PATH_IMAGE013
It should be noted that it is reasonable to replace the hyponym with the hypernym, and it is not reasonable to replace the hypernym with the hyponym during the disease normalization process. Similar to the method in synonyms, the minimum word length and the maximum word length in the dictionary are respectively counted as
Figure 559056DEST_PATH_IMAGE014
And
Figure 413879DEST_PATH_IMAGE015
the edit distance calculation formula fused with the upper and lower word dictionary is as follows:
Figure 163923DEST_PATH_IMAGE016
Figure 762394DEST_PATH_IMAGE017
Figure 780029DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 336912DEST_PATH_IMAGE019
represents the minimum editing distance between the text to be matched and the reference word under the upper and lower level word dictionary H, i =0 represents that the first row is initialized by the upper and lower level word relation matrix, j =0 represents that the first column is initialized by the upper and lower level word relation matrix,
Figure 807208DEST_PATH_IMAGE020
is a transfer equation of a matrix of upper and lower relations, w is a word in a text to be matched, w' is a word in an upper and lower word dictionary H,
Figure 943791DEST_PATH_IMAGE021
the method includes the steps that if a word w in a text to be matched and a single word w ' of a reference text are upper-level and lower-level words, the editing distance between the word w in the text to be matched and the word w ' of the reference text is b, otherwise, the minimum number of editing operations required for converting the word w into the word w ' is calculated, b represents the weight of an upper-level word, the value of b is generally 0.13, and specifically, whether an upper-level word relation can be formed between words at the tail end of a word at the current position can be judged. If the hypernym relationship can be constructed, a smaller replacement cost is used. For example: the input non-standard text is: the "nodule", the standard expression of the current candidate is "placeholder". "placeholder" is an hypernym of "nodule". If the algorithm currently prevailing is used, the calculated distance is 3, and using the algorithm of the present application, the calculated distance is 0.13. The smaller this distance, the more matched the two.
Similarly, regarding the edit distance of the weighted word, the processing logic of the word with high weight and the word with low weight is the same at the algorithm level, and the difference between the two is only the difference of the weights
Figure 81511DEST_PATH_IMAGE022
The edit distance calculation formula of the fusion weight dictionary is as follows:
Figure 42252DEST_PATH_IMAGE023
Figure 999843DEST_PATH_IMAGE024
Figure 205697DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 197924DEST_PATH_IMAGE026
represents the minimum editing distance between the text to be matched and the reference word under the weight dictionary I, I =0 represents that the first row is initialized by the weight word relation matrix, j =0 represents that the first column is initialized by the weight word relation matrix,
Figure 565451DEST_PATH_IMAGE027
is a weight relation matrix transfer equation, w is a word in the text to be matched, w' is a word in the weight word dictionary I,
Figure 275918DEST_PATH_IMAGE028
the method comprises the steps of representing that if a word w in a text to be matched and a word w ' of a reference text are upper and lower-level words, the editing distance between the word w in the text to be matched and the word w ' of the reference text is K (w), otherwise, calculating the minimum editing operation times required for converting the word w into the word w ', and judging whether a word ending with a word at the current position belongs to an important word or an unimportant word when a dynamic programming algorithm is carried out. If the word is a word with higher weight, the operation cost of the word is higher, and if the word is a word with lower weight, the operation cost of the word is lower. If not, the edit distance is calculated according to the default operation cost, i.e. the length of the word.
For example: the input non-standard text is: "bacterial pneumonia", the current candidate standard expression is "bacterial disease". If the currently prevailing algorithm is used, the calculated distance is 3. However, "disease" is a commonly occurring word in the medical corpus, while "bacterial pneumonia" and "bacterial disease" are different diseases. Assume that the weight for "pneumonia" is 4 and the weight for "illness" is 0.5. In this application, the calculated distance is 4.5. The larger the distance is, the difference between the distance and the distance can be highlighted, and misjudgment can be prevented.
And finally, fusing the matching degrees of the three modes, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text, wherein the following formula can be adopted for fusion calculation:
Figure 521348DEST_PATH_IMAGE029
it should be noted that the data sets of the specific medical insurance ICD10/ICD9-CM3 are adjusted to 96%, the accuracy of the universal version is 91.1%/92.56, and the capability of continuously improving the performance along with the accumulation of the business data is provided. And the accuracy of manual labeling is 95%. The method has the advantages that the accuracy rate of manual labeling is close to that of the general field, the effect of manual labeling is exceeded after the targeted tuning, namely, a large amount of priori knowledge based on words is utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, the situation of errors in word segmentation of the text to be matched can be avoided by taking the words as units, and therefore the accuracy of text matching is improved.
For another example, in the scenario of robot online question answering, the text input by the user is: "how to unpick and wash an air conditioner of xx model", in the solution of the present application, a reference dictionary corresponding to the text, such as a dictionary in the home field, may be obtained first, then, a plurality of text words in the input text are combined to obtain candidate words associated with reference word meanings in the reference dictionary, then, according to edit distances between the candidate words and the reference words in a target matching type, matching degrees between the candidate words and the target reference words are generated, next, matching degrees between the candidate words and the target reference words in each matching type are fused, wherein the method for calculating the edit distance refers to the foregoing embodiment, which is not described herein again, and then, according to the fusion result, a reference text matching the text is selected from a preset reference text library, that is, a standard question about how to unpick and wash an air conditioner of xx model in a scenario of a robot on-line question and answer is selected, for example, the reference text may be "how the xx model of the air conditioner is disassembled and cleaned", and finally, the answer corresponding to the reference text is output.
In order to better implement the text matching method of the present application, the present application further provides a text matching device (matching device for short) based on the above. The meanings of the nouns are the same as those in the text matching method, and specific implementation details can refer to the description in the method embodiment.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text matching apparatus provided in the present application, where the distribution apparatus may include an obtaining module 301, a combining module 302, a generating module 303, a fusing module 304, and an output module 305, and specifically the following may be included:
the obtaining module 301 is configured to obtain a text to be matched including a plurality of text words and a reference dictionary corresponding to the text to be matched.
For example, the content of the text to be matched belongs to the medical field, and the reference dictionary corresponding to the content of the text to be matched is a medical dictionary, and specifically, the text to be matched and the reference dictionary corresponding to the text to be matched may be stored in a local database, or may be obtained by the obtaining module 301 through pulling by accessing a network interface, and is specifically determined according to the actual situation.
And the combination module 302 is used for combining the plurality of text single words to obtain a candidate word associated with at least one reference word meaning.
Optionally, in some embodiments, the combining module 302 may be specifically configured to: and identifying the part of speech of each text single word, removing the text single words of which the part of speech is a help word, and arranging and combining the reserved text single words to obtain a candidate word associated with at least one reference word meaning.
The generating module 303 is configured to generate a matching degree between the candidate word and the target reference word according to an edit distance between the candidate word and the reference word in at least one matching type.
For example, specifically, a target matching type corresponding to each reference word may be determined according to a semantic association relationship between the candidate word and the reference word, and then a matching degree between the candidate word and the target reference word is generated according to the determined target matching type, that is, optionally, in some embodiments, the generating module 303 may specifically include:
the determining submodule is used for determining a target matching type corresponding to each reference word according to the semantic association relation between the candidate words and the reference words;
the calculation sub-module is used for calculating the editing distance between the candidate word and at least one reference word in the determined target matching type based on the determined target matching type, and determining the reference word with the minimum editing distance as the target reference word;
and the generating sub-module is used for generating the matching degree between the candidate words and the target reference words according to the editing distance between the candidate words and the target reference words under at least one matching type.
Optionally, in some embodiments, the calculation sub-module may specifically include:
the first calculation unit is used for calculating a first editing distance between the candidate word and at least one reference word under the synonym matching type, and determining the reference word with the minimum first editing distance as a first target reference word;
the second calculating unit is used for calculating a second editing distance between the candidate word and at least one reference word under the hypernym matching type, and determining the reference word with the minimum second editing distance as a second target reference word;
and the third calculating unit is used for calculating a third editing distance between the candidate word and at least one reference word under the matching type of the weighted words, and determining the reference word with the minimum third editing distance as a third target reference word.
Optionally, in some embodiments, the first computing unit may specifically be configured to: selecting a synonym cluster set in a reference dictionary, wherein the synonym cluster set comprises a plurality of synonym clusters, and each synonym cluster comprises at least two reference words with the same word sense; determining synonym clusters with the same semantics as the candidate words to obtain target synonym clusters; calculating a first editing distance between the candidate word and each reference word in the target synonym cluster, and determining the reference word with the minimum first editing distance as a first target reference word; the generation submodule is specifically configured to: and generating a first matching degree between the candidate word and the first target reference word according to a first editing distance between the candidate word and the first target reference word.
Optionally, in some embodiments, the second computing unit may specifically be configured to: determining the superior-inferior relation between the candidate word and at least one reference word according to the semantics of the candidate word and the semantics of each reference word; calculating a second editing distance between the candidate word and the corresponding upper reference word based on the determined upper and lower relation, and determining the reference word with the minimum second editing distance as a second target reference word; the generation submodule may specifically be configured to: and generating a second matching degree between the candidate word and the second target reference word according to a second editing distance between the candidate word and the second target reference word.
Optionally, in some embodiments, the third computing unit may specifically include:
the acquisition subunit is used for acquiring a weight value pre-established by each reference word;
the calculating subunit is used for calculating the similarity between the candidate word and each reference word and determining the reference word with the similarity larger than a preset value as a candidate word;
the determining subunit is configured to calculate a third editing distance between the candidate word and the determined candidate word according to the weight of the determined candidate word, and determine the candidate word with the minimum third editing distance as a third target reference word;
the generation submodule is specifically configured to: and generating a third matching degree between the candidate word and the third target reference word according to a third editing distance between the candidate word and the third target reference word.
Optionally, in some embodiments, the determining subunit may specifically be configured to: and calculating a third editing distance between the candidate word and the candidate word with the weight smaller than the preset weight, and determining the candidate word with the minimum third editing distance as a third target reference word.
And the fusion module 304 is configured to fuse matching degrees between the candidate words and the target reference words under the matching types.
Optionally, in some embodiments, the fusion module 304 may be specifically configured to: the method comprises the steps of obtaining preset weight coefficients corresponding to all matching types, calculating the product of the obtained weight coefficients and the matching degree between a candidate word and a target reference word under the corresponding matching type, obtaining the matching degree after weighting corresponding to all matching types, and fusing the matching degree after weighting corresponding to all matching types.
And the output module 305 is configured to select a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and output the reference text.
For example, specifically, the output module 305 may determine, according to the fusion result, a reference text matched with the text to be matched in a preset reference text library, and output the determined reference text.
After an acquisition module 301 of the present application acquires a text to be matched containing a plurality of text words and a reference dictionary corresponding to the text to be matched, a combination module 302 combines the plurality of text words to obtain a candidate word associated with at least one reference word meaning, a generation module 303 generates a matching degree between the candidate word and a target reference word according to an edit distance between the candidate word and the target reference word in at least one matching type, a fusion module 304 fuses the matching degree between the candidate word and the target reference word in each matching type, and finally, an output module 305 selects a reference text matched with the text to be matched from a preset reference text library according to a fusion result and outputs the reference text, the text matching apparatus provided in the present application generates a matching degree between the candidate word and the target reference word according to an edit distance between the candidate word and the target reference word in at least one matching type, the matching degree between the candidate words and the target reference words under each matching type is fused, a large amount of word-based priori knowledge can be utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, and the condition of errors in word segmentation of the text to be matched can be avoided by taking the characters as units, so that the accuracy of text matching is improved.
In addition, the present application also provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device related to the present application, specifically:
the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:
the method comprises the steps of obtaining a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, combining the plurality of text single words to obtain a candidate word associated with at least one reference word meaning, generating a matching degree between the candidate word and a target reference word according to an editing distance between the candidate word and the target reference word in at least one matching type, fusing the matching degree between the candidate word and the target reference word in each matching type, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
The text matching method comprises the steps of obtaining a text to be matched containing a plurality of text words and a reference dictionary corresponding to the text to be matched, combining the text words to obtain a candidate word associated with at least one reference word meaning, generating a matching degree between the candidate word and a target reference word according to an editing distance of the candidate word and the target reference word in at least one matching type, fusing the matching degree between the candidate word and the target reference word in each matching type, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text, fusing the matching degree between the candidate word and the target reference word in each matching type after generating the matching degree between the candidate word and the target reference word according to the editing distance of the candidate word and the target reference word in at least one matching type, the method has the advantages that a large amount of word-based prior knowledge can be utilized, so that more reference words are obtained, the accuracy of subsequent text matching is improved, and the condition of errors in word segmentation of the text to be matched can be avoided by taking the word as a unit, so that the accuracy of text matching is improved.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, the present application provides a storage medium having stored therein a plurality of instructions that can be loaded by a processor to perform the steps of any of the text matching methods provided herein. For example, the instructions may perform the steps of:
the method comprises the steps of obtaining a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, combining the plurality of text single words to obtain a candidate word associated with at least one reference word meaning, generating a matching degree between the candidate word and a target reference word according to an editing distance between the candidate word and the target reference word in at least one matching type, fusing the matching degree between the candidate word and the target reference word in each matching type, selecting a reference text matched with the text to be matched from a preset reference text library according to a fusion result, and outputting the reference text.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any text matching method provided by the present application, the beneficial effects that any text matching method provided by the present application can achieve can be achieved, for details, see the foregoing embodiments, and are not described herein again.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.
The text matching method, the text matching device, the electronic device, and the storage medium provided by the present application are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (12)

1. A text matching method, comprising:
acquiring a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, wherein the reference dictionary is a dictionary in the field to which the content of the text to be matched belongs, and comprises at least one reference word;
combining a plurality of text single words to obtain candidate words associated with at least one reference word meaning;
generating a matching degree between the candidate word and a target reference word according to the editing distance between the candidate word and at least one reference word in a target matching type;
fusing the matching degree between the candidate words and the target reference words under each matching type;
and selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result, and outputting the reference text.
2. The method of claim 1, wherein generating the matching degree between the candidate word and the target reference word according to the edit distance between the candidate word and the at least one reference word in the target matching type comprises:
determining a target matching type corresponding to each reference word according to the semantic association relationship between the candidate words and the reference words;
based on the determined target matching type, calculating the editing distance between the candidate word and at least one reference word in the determined target matching type, and determining the reference word with the minimum editing distance as a target reference word;
and generating the matching degree between the candidate word and the target reference word according to the editing distance between the candidate word and the target reference word under at least one matching type.
3. The method according to claim 2, wherein the step of calculating an edit distance between the candidate word and at least one reference word in the determined target matching type based on the determined target matching type, and determining the reference word with the smallest edit distance as the target reference word comprises:
calculating a first editing distance between the candidate word and at least one reference word under the synonym matching type, and determining the reference word with the minimum first editing distance as a first target reference word;
calculating a second editing distance between the candidate word and at least one reference word under the superior word matching type, and determining the reference word with the minimum second editing distance as a second target reference word;
and calculating a third editing distance between the candidate word and at least one reference word under the matching type of the weighted words, and determining the reference word with the minimum third editing distance as a third target reference word.
4. The method according to claim 3, wherein the calculating a first edit distance between the candidate word and at least one reference word in a synonym matching type, and determining the reference word with the smallest first edit distance as the first target reference word comprises:
selecting a synonym cluster set in the reference dictionary, wherein the synonym cluster set comprises a plurality of synonym clusters, and each synonym cluster comprises at least two reference words with the same word sense;
determining a synonym cluster with the same semantic meaning as the candidate word to obtain a target synonym cluster;
calculating a first editing distance between the candidate word and each reference word in the target synonym cluster, and determining a first target reference word from the reference word with the minimum first editing distance;
the generating the matching degree between the candidate word and the target reference word according to the edit distance between the candidate word and the target reference word under at least one matching type includes: and generating a first matching degree between the candidate word and the first target reference word according to a first editing distance between the candidate word and the first target reference word.
5. The method according to claim 3, wherein the calculating a second edit distance between the candidate word and at least one reference word in the hypernym matching type, and determining the reference word with the smallest second edit distance as a second target reference word comprises:
determining the superior-inferior relation between the candidate word and at least one reference word according to the semantics of the candidate word and the semantics of each reference word;
calculating a second editing distance between the candidate word and the corresponding superior reference word based on the determined superior-inferior relation, and determining the reference word with the minimum second editing distance as a second target reference word;
the generating the matching degree between the candidate word and the target reference word according to the edit distance between the candidate word and the target reference word under at least one matching type includes: and generating a second matching degree between the candidate word and a second target reference word according to a second editing distance between the candidate word and the second target reference word.
6. The method according to claim 3, wherein the calculating a third edit distance between the candidate word and at least one reference word in the weighted word matching type, and determining the reference word with the smallest third edit distance as a third target reference word comprises:
acquiring a weight value pre-established for each reference word;
calculating the similarity between the candidate words and each reference word, and determining the reference words with the similarity larger than a preset value as the words to be selected;
calculating a third editing distance between the candidate word and the determined word to be selected according to the weight of the determined word to be selected, and determining the word to be selected with the minimum third editing distance as a third target reference word;
the generating the matching degree between the candidate word and the target reference word according to the edit distance between the candidate word and the target reference word under at least one matching type includes: and generating a third matching degree between the candidate word and a third target reference word according to a third editing distance between the candidate word and the third target reference word.
7. The method according to claim 6, wherein the calculating a third edit distance between the candidate word and the determined candidate word according to the weight of the determined candidate word, and determining the candidate word with the minimum third edit distance as a third target reference word comprises:
and calculating a third editing distance between the candidate word and the candidate word with the weight smaller than the preset weight, and determining the candidate word with the minimum third editing distance as a third target reference word.
8. The method according to any one of claims 1 to 7, wherein the fusing the matching degrees between the candidate words and the target reference words under the matching types comprises:
acquiring a preset weight coefficient corresponding to each matching type;
calculating the product of the obtained weight coefficient and the matching degree between the candidate word and the target reference word under the corresponding matching type to obtain the weighted matching degree corresponding to each matching type;
and fusing the weighted matching degrees corresponding to the matching types.
9. The method of any one of claims 1 to 7, wherein combining a plurality of text words to obtain a candidate word associated with at least one reference word sense comprises:
identifying the part of speech of each text single character;
and removing the text single words with parts of speech as auxiliary words, and arranging and combining the reserved text single words to obtain candidate words associated with at least one reference word meaning.
10. A text matching apparatus, comprising:
the device comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring a text to be matched containing a plurality of text single words and a reference dictionary corresponding to the text to be matched, the reference dictionary is a dictionary in the field to which the content of the text to be matched belongs, and the reference dictionary comprises at least one reference word;
the combination module is used for combining a plurality of text single words to obtain candidate words associated with at least one reference word meaning;
the generating module is used for generating the matching degree between the candidate word and the target reference word according to the editing distance between the candidate word and the reference word under at least one matching type;
the fusion module is used for fusing the matching degree between the candidate words and the target reference words under each matching type;
and the output module is used for selecting a reference text matched with the text to be matched from a preset reference text library according to the fusion result and outputting the reference text.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the text matching method according to any of claims 1-9 are implemented when the program is executed by the processor.
12. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the steps of the text matching method according to any one of claims 1 to 9.
CN202011045975.8A 2020-09-29 2020-09-29 Text matching method and device, electronic equipment and storage medium Active CN111931477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011045975.8A CN111931477B (en) 2020-09-29 2020-09-29 Text matching method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011045975.8A CN111931477B (en) 2020-09-29 2020-09-29 Text matching method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111931477A CN111931477A (en) 2020-11-13
CN111931477B true CN111931477B (en) 2021-01-05

Family

ID=73334752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011045975.8A Active CN111931477B (en) 2020-09-29 2020-09-29 Text matching method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111931477B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528670B (en) * 2020-12-01 2022-08-30 清华大学 Word meaning processing method and device, electronic equipment and storage medium
CN112507709A (en) * 2020-12-28 2021-03-16 科大讯飞华南人工智能研究院(广州)有限公司 Document matching method, electronic device and storage device
CN112733492B (en) * 2020-12-31 2022-05-03 平安医疗健康管理股份有限公司 Knowledge base-based aided design method and device, terminal and storage medium
CN112837771B (en) * 2021-01-25 2022-09-13 山东浪潮智慧医疗科技有限公司 Abnormal physical examination item normalization method integrating text classification and lexical analysis
CN113254658B (en) * 2021-07-07 2021-12-21 明品云(北京)数据科技有限公司 Text information processing method, system, medium, and apparatus
CN114783419B (en) * 2022-06-21 2022-09-27 深圳市友杰智新科技有限公司 Text recognition method and device combined with priori knowledge and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9996603B2 (en) * 2014-10-14 2018-06-12 Adobe Systems Inc. Detecting homologies in encrypted and unencrypted documents using fuzzy hashing
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN109101493A (en) * 2018-08-01 2018-12-28 东北大学 A kind of intelligence house-purchase assistant based on dialogue robot
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
US10664526B2 (en) * 2014-12-05 2020-05-26 Facebook, Inc. Suggested keywords for searching content on online social networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2469421A1 (en) * 2010-12-23 2012-06-27 British Telecommunications Public Limited Company Method and apparatus for processing electronic data
CN108073565A (en) * 2016-11-10 2018-05-25 株式会社Ntt都科摩 The method and apparatus and machine translation method and equipment of words criterion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9996603B2 (en) * 2014-10-14 2018-06-12 Adobe Systems Inc. Detecting homologies in encrypted and unencrypted documents using fuzzy hashing
US10664526B2 (en) * 2014-12-05 2020-05-26 Facebook, Inc. Suggested keywords for searching content on online social networks
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN109101493A (en) * 2018-08-01 2018-12-28 东北大学 A kind of intelligence house-purchase assistant based on dialogue robot
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于群组平台知识圈的精准信息推荐;王峰等;《现代情报》;20180731;第38卷(第7期);第74-80页 *

Also Published As

Publication number Publication date
CN111931477A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111931477B (en) Text matching method and device, electronic equipment and storage medium
US11556713B2 (en) System and method for performing a meaning search using a natural language understanding (NLU) framework
CN112131366B (en) Method, device and storage medium for training text classification model and text classification
JP6583686B2 (en) Semantic information generation method, semantic information generation device, and program
JP5936698B2 (en) Word semantic relation extraction device
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
Rashid et al. A survey paper: areas, techniques and challenges of opinion mining
US20200342052A1 (en) Syntactic graph traversal for recognition of inferred clauses within natural language inputs
Wang et al. Learning distributed word representations for bidirectional lstm recurrent neural network
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
JP7308910B2 (en) WORD SLOT RECOGNITION METHOD, APPARATUS AND ELECTRONIC DEVICE
CN109726289A (en) Event detecting method and device
CN113095080B (en) Theme-based semantic recognition method and device, electronic equipment and storage medium
US20220245353A1 (en) System and method for entity labeling in a natural language understanding (nlu) framework
CN112052318A (en) Semantic recognition method and device, computer equipment and storage medium
Ren et al. Detecting the scope of negation and speculation in biomedical texts by using recursive neural network
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN110888970B (en) Text generation method, device, terminal and storage medium
CN112287656A (en) Text comparison method, device, equipment and storage medium
US20220229994A1 (en) Operational modeling and optimization system for a natural language understanding (nlu) framework
US20220245361A1 (en) System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework
US20220238103A1 (en) Domain-aware vector encoding (dave) system for a natural language understanding (nlu) framework
El Janati et al. Adaptive e-learning AI-powered chatbot based on multimedia indexing
US9286289B2 (en) Ordering a lexicon network for automatic disambiguation
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant