CN111581329A - Short text matching method and device based on inverted index - Google Patents
Short text matching method and device based on inverted index Download PDFInfo
- Publication number
- CN111581329A CN111581329A CN202010328205.8A CN202010328205A CN111581329A CN 111581329 A CN111581329 A CN 111581329A CN 202010328205 A CN202010328205 A CN 202010328205A CN 111581329 A CN111581329 A CN 111581329A
- Authority
- CN
- China
- Prior art keywords
- matching
- rule
- template
- inverted index
- phrase
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 8
- 238000013507 mapping Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 7
- 239000004606 Fillers/Extenders Substances 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 abstract description 2
- 230000014509 gene expression Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention is suitable for the technical field of natural language processing, and provides a short text matching method and a short text matching device based on inverted indexes, wherein the method comprises the following steps: and (4) extracting the features of the input characters, matching the extracted features with rule templates in a knowledge base one by one, and searching the most appropriate template. Particularly, after the characteristics are extracted, the invention adopts the inverted index technology to establish the inverted index for the input characters, optimizes the calculation efficiency during matching and greatly accelerates the process of matching with templates in a knowledge base one by one. The device comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher. The invention can be applied to the problem matching in intelligent customer service and question-answering systems or the user input matching in other information retrieval scenes, can also carry out flexible and complex text matching rule setting, and simultaneously ensures the high-efficiency execution of the matching process.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a short text matching method and device based on an inverted index.
Background
Natural language processing is a technology for studying interaction between a human and a computer through natural language, and text matching is an important task in natural language processing. In a question-answering system, the user's question can be answered by matching the text of the question input by the user with all questions in a knowledge base established in advance and finding the answer to the matched question. Text matching generally includes matching between text and text, and matching between text and rule templates. The inverted index is a technology for searching records according to the value of the attribute, and generally relates to the field of information retrieval to accelerate the efficiency of full-text search of a search engine.
At present, text matching generally includes two ways of matching between texts and rule templates. The matching technology between the text and the text is simple to use, but semantic matching is often not accurate enough, the technology based on deep learning at present has certain breakthrough in accuracy, but the demand on data volume is large, and the technology is not accurate enough under the condition of less data. The regular expression technology needs a certain professional knowledge base when used, is not intuitive and is easy to make mistakes, and particularly, when the occurrence frequency of the operational character is large, the performance index level is reduced by a fuzzy matching mechanism of the regular expression technology.
In addition, the method proposed by CN201811241976 is simple and easy to use compared with the regular expression, and the performance is fast, but the matching capability is not strong enough, and cannot support the rule based on the front and rear positions of the phrase.
Disclosure of Invention
The invention provides a short text matching method and device based on inverted indexes, and mainly aims to be used in the field of intelligent customer service question answering, accurately match questions and correctly answer questions of users.
The invention is realized in this way, the short text matching method based on the inverted index includes the following steps:
s1, feature extraction: extracting features of an input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text;
s2, feature expansion: expanding the features extracted in the step S1, and matching the extracted synonyms or category names of the phrases with the positions of the phrases in the text to serve as new features;
s3, generating an inverted index: establishing an inverted index for all the characteristics;
s4, rule matching: matching the inverted index with a preset rule template in sequence, and outputting a matching result;
s5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
Preferably, the feature extraction specifically comprises:
presetting a phrase dictionary, performing phrase matching on an input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time;
if the two phrases are mutually overlapped, selecting the phrase with longer length and discarding the phrase with shorter length; if the lengths are the same, the top phrase is selected.
Preferably, the feature extension is specifically:
and presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table.
Preferably, the rule matching specifically includes:
presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure.
Preferably, the output result is specifically:
for all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict;
when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base;
if no conflict exists, outputting the serial number of the rule template which is successfully matched as an output result; and if the conflict continues to exist, outputting the result that all matching is judged to fail.
The invention also provides a short text matching device based on the inverted index, which is characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;
the rule template knowledge base comprises a plurality of predefined rule templates and information of relative priorities among the rule templates;
the feature extractor comprises a preset phrase dictionary, and is used for extracting phrases existing in the phrase dictionary and the input text at the same time during operation;
the feature extender comprises a predefined phrase mapping table and is used for extending the features extracted by the feature extractor during operation;
the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander;
the template compiler is used for compiling the predefined rule template in the rule template knowledge base;
and the template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and screening and outputting a final matching result according to the priority rule in the rule template knowledge base if a plurality of successfully matched rule templates exist.
Preferably, the template matching system further comprises a template matching buffer, which is used for providing a buffer service in the template matching process and accelerating the overall matching efficiency.
Compared with the prior art, the invention has the beneficial effects that: according to the short text matching method and device based on the inverted index, the features of the input characters are extracted, the extracted features are matched with the rule templates in the knowledge base one by one, the most appropriate template is found, and after the features are extracted, the inverted index is established for the input characters by adopting the inverted index technology, so that the calculation efficiency during matching is optimized, the process of matching the characters with the templates in the knowledge base one by one is greatly accelerated, flexible and complex text matching rule setting can be performed, and meanwhile, the high-efficiency execution of the matching process is guaranteed.
Drawings
Fig. 1 is a flowchart illustrating a short text matching method based on inverted indexes according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example one
Referring to fig. 1, the present embodiment provides a technical solution: the short text matching method based on the inverted index comprises the following steps:
s1, feature extraction: and performing feature extraction on the input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text.
Presetting a phrase dictionary, performing phrase matching on the input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time. The predefined dictionary may be a plurality of lines of text, one phrase for each line.
If there is mutual overlap between two phrases, the longer phrase is selected and the shorter one is discarded. If the lengths are the same, the top phrase is selected.
A feature Fx, consisting of an extracted phrase Fxs and the phrase's position Fxp:
Fx=(Fxs,Fxp)。
all the extracted features { F1, F2, F3 … Fn } constitute the input text extracted feature table.
S2, feature expansion: and expanding the features extracted in the step S1, and using the synonyms or category names of the extracted phrases as new features according to the positions of the phrases in the text. And presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table. The mapped values may be synonyms of phrases or classification names of phrases, and the same phrase may be mapped to multiple different values. The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.
That is, for any extracted feature Fx ═ (Fxs, Fxp), the phrase Fxs maps a series of values Fxsm1, Fxsm2, Fxsm3.. Fxsmn, and (Fxsm1, Fxp), (Fxsm2, Fxp), (Fxsm3, Fxp.) (Fxsmn, Fxp) are added to the feature table as new features, respectively.
The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.
S3, generating an inverted index: an inverted index is built for all features. And generating an inverted index table according to the expanded feature table, and establishing an index from Fxs to Fxp for the feature Fx (Fxs, Fxp) in any expanded feature table, so that the position of the phrase in the text can be found through the phrase or the mapped value of the phrase.
The same phrase or the mapped value of the phrase may exist multiple times, so that the position value of the text which is inquired out may exist multiple times.
S4, rule matching: and matching the inverted index with a preset rule template in sequence, and outputting a matching result. Presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure. A rule template is a line of text and consists of a series of values and the sequence between the values, and/or non-logical relations. During operation, all rule templates defined through the text form can be compiled into objects in the memory, and efficiency is improved.
S5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
And regarding all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict.
And when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base.
If no conflict exists, the output result is the number of a successfully matched rule template. If the conflict continues to exist, the output result is null, which indicates that all matching is judged to fail, and a warning of rule conflict is recorded in the log.
The embodiment also provides a short text matching device based on the inverted index, which is used for applying the short text matching method based on the inverted index, and comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler, a template matcher and a template matching buffer.
The rule template knowledge base contains a predefined plurality of rule templates and information of relative priorities between the plurality of rule templates.
The feature extractor includes a pre-set phrase dictionary, and the feature extractor is operable to extract phrases that are present in both the phrase dictionary and the input text.
The feature extender comprises a predefined phrase mapping table and extends the features extracted by the feature extractor during operation.
And the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander.
The template compiler is used for compiling the predefined rule templates in the rule template knowledge base.
The template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and if a plurality of successfully matched rule templates exist, screening according to the priority rules in the rule template knowledge base and outputting a final matching result.
The template matching buffer is used for providing a buffer service in the template matching process, and the overall matching efficiency is improved.
In summary, the short text matching method and device based on the inverted index of the present invention, under the condition that the advanced expression capability of part of regular expressions is provided, accelerates the text matching efficiency through the inverted index technology, can be applied to an intelligent customer service and question-answering system, efficiently matches the questions input by the user, and outputs the result. Compared with the deep learning model, the method has no requirement on data volume, but needs a practitioner to manually maintain some rule tables and word tables. By using the method and the device, flexible and complex text matching rule setting can be carried out, and meanwhile, the high-efficiency execution of the matching process is ensured.
Example two
The embodiment is based on the short text matching method and device based on the inverted index, which are provided by the embodiment, and the short text matching is realized for one time, wherein:
the predefined rule templates are as follows:
1. what is flower
2. How to use flower
3. You good
And defines a priority of 2 greater than 1.
The predefined good phrase dictionary is as follows:
1. flower
2. Is that
3. What is
4. How to use
The predefined phrase mapping table is as follows:
1. what is what
The text entered is as follows:
1. please ask what we are about? How should flowers be used?
Then the following steps are carried out:
1) what the bei of flower is extracted by the feature extraction device to obtain the phrase and the phrase position in the dictionary, and according to the example, the extracted features are as follows:
1. (flower bei, 3)
(yes, 5)
(what, 6)
(flower bei, 8)
(how to use, 13)
2) The extracted features are expanded by a feature expander, according to the example, "what" is mapped to, "the expanded features are:
1. (flower bei, 3)
(yes, 5)
(what, 6)
(flower bei, 8)
(how to use, 13)
(what, 6)
3) The expanded features are generated into an inverted index by an inverted index generator, and according to the example, the generated inverted index is:
1. flower [3,8]
2. Is [5]
3. What [6]
4. How to use [13]
5. What [6]
4) And matching the generated inverted index with all rule templates one by one, wherein according to the example, the matching result is as follows:
1. what is flower
2. How to use flower
5) The matching result is processed according to the priority, in this embodiment, rule 2 has higher priority than 1, so the matching result output last is:
how much flowers are used.
In summary, according to the method and apparatus for matching short texts based on inverted indexes provided in the embodiment, it is achieved that "asking for a question about what is? How should flowers be used? The short text matching has accurate output result and high matching efficiency.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (7)
1. The short text matching method based on the inverted index is characterized in that: the method comprises the following steps:
s1, feature extraction: extracting features of an input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text;
s2, feature expansion: expanding the features extracted in the step S1, and matching the extracted synonyms or category names of the phrases with the positions of the phrases in the text to serve as new features;
s3, generating an inverted index: establishing an inverted index for all the characteristics;
s4, rule matching: matching the inverted index with a preset rule template in sequence, and outputting a matching result;
s5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
2. The inverted index-based short text matching method as claimed in claim 1, wherein: the feature extraction specifically comprises the following steps:
presetting a phrase dictionary, performing phrase matching on an input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time;
if the two phrases are mutually overlapped, selecting the phrase with longer length and discarding the phrase with shorter length; if the lengths are the same, the top phrase is selected.
3. The inverted index-based short text matching method according to claim 2, characterized in that: the feature extension specifically is:
and presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table.
4. The inverted index-based short text matching method according to claim 3, characterized in that: the rule matching specifically comprises:
presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure.
5. The inverted index-based short text matching method as claimed in claim 4, wherein: the output result is specifically as follows:
for all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict;
when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base;
if no conflict exists, outputting the serial number of the rule template which is successfully matched as an output result; and if the conflict continues to exist, outputting the result that all matching is judged to fail.
6. Short text matching device based on inverted index, its characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;
the rule template knowledge base comprises a plurality of predefined rule templates and information of relative priorities among the rule templates;
the feature extractor comprises a preset phrase dictionary, and is used for extracting phrases existing in the phrase dictionary and the input text at the same time during operation;
the feature extender comprises a predefined phrase mapping table and is used for extending the features extracted by the feature extractor during operation;
the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander;
the template compiler is used for compiling the predefined rule template in the rule template knowledge base;
and the template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and screening and outputting a final matching result according to the priority rule in the rule template knowledge base if a plurality of successfully matched rule templates exist.
7. The inverted index-based short text matching apparatus as claimed in claim 6, wherein: the template matching cache is used for providing cache service in the template matching process and accelerating the overall matching efficiency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010328205.8A CN111581329A (en) | 2020-04-23 | 2020-04-23 | Short text matching method and device based on inverted index |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010328205.8A CN111581329A (en) | 2020-04-23 | 2020-04-23 | Short text matching method and device based on inverted index |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111581329A true CN111581329A (en) | 2020-08-25 |
Family
ID=72114965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010328205.8A Pending CN111581329A (en) | 2020-04-23 | 2020-04-23 | Short text matching method and device based on inverted index |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111581329A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395885A (en) * | 2020-11-27 | 2021-02-23 | 安徽迪科数金科技有限公司 | Short text semantic understanding template generation method, semantic understanding processing method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040078190A1 (en) * | 2000-09-29 | 2004-04-22 | Fass Daniel C | Method and system for describing and identifying concepts in natural language text for information retrieval and processing |
CN103902652A (en) * | 2014-02-27 | 2014-07-02 | 深圳市智搜信息技术有限公司 | Automatic question-answering system |
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
-
2020
- 2020-04-23 CN CN202010328205.8A patent/CN111581329A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040078190A1 (en) * | 2000-09-29 | 2004-04-22 | Fass Daniel C | Method and system for describing and identifying concepts in natural language text for information retrieval and processing |
CN103902652A (en) * | 2014-02-27 | 2014-07-02 | 深圳市智搜信息技术有限公司 | Automatic question-answering system |
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
Non-Patent Citations (2)
Title |
---|
江有福等: "自然语言网络答疑系统中倒排索引技术的研究与实现" * |
齐翌辰;王森淼;赵亚慧;: "基于倒排索引的问答系统的设计与实现" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112395885A (en) * | 2020-11-27 | 2021-02-23 | 安徽迪科数金科技有限公司 | Short text semantic understanding template generation method, semantic understanding processing method and device |
CN112395885B (en) * | 2020-11-27 | 2024-01-26 | 安徽迪科数金科技有限公司 | Short text semantic understanding template generation method, semantic understanding processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649742B (en) | Database maintenance method and device | |
CN108304375B (en) | Information identification method and equipment, storage medium and terminal thereof | |
KR102256240B1 (en) | Non-factoid question-and-answer system and method | |
CN108681574B (en) | Text abstract-based non-fact question-answer selection method and system | |
CN106776564B (en) | Semantic recognition method and system based on knowledge graph | |
CN103970798B (en) | The search and matching of data | |
CN108664599B (en) | Intelligent question-answering method and device, intelligent question-answering server and storage medium | |
CN110276080B (en) | Semantic processing method and system | |
CN111104803B (en) | Semantic understanding processing method, device, equipment and readable storage medium | |
CN107665188B (en) | Semantic understanding method and device | |
CN109508441B (en) | Method and device for realizing data statistical analysis through natural language and electronic equipment | |
CN105893351B (en) | Audio recognition method and device | |
CN108108344B (en) | Method and device for jointly recognizing and connecting named entities | |
CN111178076A (en) | Named entity identification and linking method, device, equipment and readable storage medium | |
CN113742446A (en) | Knowledge graph question-answering method and system based on path sorting | |
CN110825840B (en) | Word bank expansion method, device, equipment and storage medium | |
CN117539990A (en) | Problem processing method and device, electronic equipment and storage medium | |
JPH0922414A (en) | Document sorting supporting method and its device | |
CN106653006A (en) | Search method and device based on voice interaction | |
CN117725183A (en) | Reordering method and device for improving retrieval performance of AI large language model | |
CN111581329A (en) | Short text matching method and device based on inverted index | |
CN117828057A (en) | Knowledge question-answering method, device, equipment and storage medium | |
CN113190692A (en) | Self-adaptive retrieval method, system and device for knowledge graph | |
CN110750632B (en) | Improved Chinese ALICE intelligent question-answering method and system | |
CN117609460A (en) | Intelligent question-answering method and device based on keyword semantic decomposition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200825 |
|
RJ01 | Rejection of invention patent application after publication |