CN111581329A - Short text matching method and device based on inverted index - Google Patents

Short text matching method and device based on inverted index Download PDF

Info

Publication number
CN111581329A
CN111581329A CN202010328205.8A CN202010328205A CN111581329A CN 111581329 A CN111581329 A CN 111581329A CN 202010328205 A CN202010328205 A CN 202010328205A CN 111581329 A CN111581329 A CN 111581329A
Authority
CN
China
Prior art keywords
matching
rule
template
inverted index
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010328205.8A
Other languages
Chinese (zh)
Inventor
陈恒生
叶浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Duiguan Information Technology Co ltd
Original Assignee
Shanghai Duiguan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Duiguan Information Technology Co ltd filed Critical Shanghai Duiguan Information Technology Co ltd
Priority to CN202010328205.8A priority Critical patent/CN111581329A/en
Publication of CN111581329A publication Critical patent/CN111581329A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention is suitable for the technical field of natural language processing, and provides a short text matching method and a short text matching device based on inverted indexes, wherein the method comprises the following steps: and (4) extracting the features of the input characters, matching the extracted features with rule templates in a knowledge base one by one, and searching the most appropriate template. Particularly, after the characteristics are extracted, the invention adopts the inverted index technology to establish the inverted index for the input characters, optimizes the calculation efficiency during matching and greatly accelerates the process of matching with templates in a knowledge base one by one. The device comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher. The invention can be applied to the problem matching in intelligent customer service and question-answering systems or the user input matching in other information retrieval scenes, can also carry out flexible and complex text matching rule setting, and simultaneously ensures the high-efficiency execution of the matching process.

Description

Short text matching method and device based on inverted index
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a short text matching method and device based on an inverted index.
Background
Natural language processing is a technology for studying interaction between a human and a computer through natural language, and text matching is an important task in natural language processing. In a question-answering system, the user's question can be answered by matching the text of the question input by the user with all questions in a knowledge base established in advance and finding the answer to the matched question. Text matching generally includes matching between text and text, and matching between text and rule templates. The inverted index is a technology for searching records according to the value of the attribute, and generally relates to the field of information retrieval to accelerate the efficiency of full-text search of a search engine.
At present, text matching generally includes two ways of matching between texts and rule templates. The matching technology between the text and the text is simple to use, but semantic matching is often not accurate enough, the technology based on deep learning at present has certain breakthrough in accuracy, but the demand on data volume is large, and the technology is not accurate enough under the condition of less data. The regular expression technology needs a certain professional knowledge base when used, is not intuitive and is easy to make mistakes, and particularly, when the occurrence frequency of the operational character is large, the performance index level is reduced by a fuzzy matching mechanism of the regular expression technology.
In addition, the method proposed by CN201811241976 is simple and easy to use compared with the regular expression, and the performance is fast, but the matching capability is not strong enough, and cannot support the rule based on the front and rear positions of the phrase.
Disclosure of Invention
The invention provides a short text matching method and device based on inverted indexes, and mainly aims to be used in the field of intelligent customer service question answering, accurately match questions and correctly answer questions of users.
The invention is realized in this way, the short text matching method based on the inverted index includes the following steps:
s1, feature extraction: extracting features of an input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text;
s2, feature expansion: expanding the features extracted in the step S1, and matching the extracted synonyms or category names of the phrases with the positions of the phrases in the text to serve as new features;
s3, generating an inverted index: establishing an inverted index for all the characteristics;
s4, rule matching: matching the inverted index with a preset rule template in sequence, and outputting a matching result;
s5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
Preferably, the feature extraction specifically comprises:
presetting a phrase dictionary, performing phrase matching on an input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time;
if the two phrases are mutually overlapped, selecting the phrase with longer length and discarding the phrase with shorter length; if the lengths are the same, the top phrase is selected.
Preferably, the feature extension is specifically:
and presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table.
Preferably, the rule matching specifically includes:
presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure.
Preferably, the output result is specifically:
for all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict;
when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base;
if no conflict exists, outputting the serial number of the rule template which is successfully matched as an output result; and if the conflict continues to exist, outputting the result that all matching is judged to fail.
The invention also provides a short text matching device based on the inverted index, which is characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;
the rule template knowledge base comprises a plurality of predefined rule templates and information of relative priorities among the rule templates;
the feature extractor comprises a preset phrase dictionary, and is used for extracting phrases existing in the phrase dictionary and the input text at the same time during operation;
the feature extender comprises a predefined phrase mapping table and is used for extending the features extracted by the feature extractor during operation;
the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander;
the template compiler is used for compiling the predefined rule template in the rule template knowledge base;
and the template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and screening and outputting a final matching result according to the priority rule in the rule template knowledge base if a plurality of successfully matched rule templates exist.
Preferably, the template matching system further comprises a template matching buffer, which is used for providing a buffer service in the template matching process and accelerating the overall matching efficiency.
Compared with the prior art, the invention has the beneficial effects that: according to the short text matching method and device based on the inverted index, the features of the input characters are extracted, the extracted features are matched with the rule templates in the knowledge base one by one, the most appropriate template is found, and after the features are extracted, the inverted index is established for the input characters by adopting the inverted index technology, so that the calculation efficiency during matching is optimized, the process of matching the characters with the templates in the knowledge base one by one is greatly accelerated, flexible and complex text matching rule setting can be performed, and meanwhile, the high-efficiency execution of the matching process is guaranteed.
Drawings
Fig. 1 is a flowchart illustrating a short text matching method based on inverted indexes according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example one
Referring to fig. 1, the present embodiment provides a technical solution: the short text matching method based on the inverted index comprises the following steps:
s1, feature extraction: and performing feature extraction on the input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text.
Presetting a phrase dictionary, performing phrase matching on the input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time. The predefined dictionary may be a plurality of lines of text, one phrase for each line.
If there is mutual overlap between two phrases, the longer phrase is selected and the shorter one is discarded. If the lengths are the same, the top phrase is selected.
A feature Fx, consisting of an extracted phrase Fxs and the phrase's position Fxp:
Fx=(Fxs,Fxp)。
all the extracted features { F1, F2, F3 … Fn } constitute the input text extracted feature table.
S2, feature expansion: and expanding the features extracted in the step S1, and using the synonyms or category names of the extracted phrases as new features according to the positions of the phrases in the text. And presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table. The mapped values may be synonyms of phrases or classification names of phrases, and the same phrase may be mapped to multiple different values. The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.
That is, for any extracted feature Fx ═ (Fxs, Fxp), the phrase Fxs maps a series of values Fxsm1, Fxsm2, Fxsm3.. Fxsmn, and (Fxsm1, Fxp), (Fxsm2, Fxp), (Fxsm3, Fxp.) (Fxsmn, Fxp) are added to the feature table as new features, respectively.
The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.
S3, generating an inverted index: an inverted index is built for all features. And generating an inverted index table according to the expanded feature table, and establishing an index from Fxs to Fxp for the feature Fx (Fxs, Fxp) in any expanded feature table, so that the position of the phrase in the text can be found through the phrase or the mapped value of the phrase.
The same phrase or the mapped value of the phrase may exist multiple times, so that the position value of the text which is inquired out may exist multiple times.
S4, rule matching: and matching the inverted index with a preset rule template in sequence, and outputting a matching result. Presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure. A rule template is a line of text and consists of a series of values and the sequence between the values, and/or non-logical relations. During operation, all rule templates defined through the text form can be compiled into objects in the memory, and efficiency is improved.
S5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
And regarding all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict.
And when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base.
If no conflict exists, the output result is the number of a successfully matched rule template. If the conflict continues to exist, the output result is null, which indicates that all matching is judged to fail, and a warning of rule conflict is recorded in the log.
The embodiment also provides a short text matching device based on the inverted index, which is used for applying the short text matching method based on the inverted index, and comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler, a template matcher and a template matching buffer.
The rule template knowledge base contains a predefined plurality of rule templates and information of relative priorities between the plurality of rule templates.
The feature extractor includes a pre-set phrase dictionary, and the feature extractor is operable to extract phrases that are present in both the phrase dictionary and the input text.
The feature extender comprises a predefined phrase mapping table and extends the features extracted by the feature extractor during operation.
And the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander.
The template compiler is used for compiling the predefined rule templates in the rule template knowledge base.
The template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and if a plurality of successfully matched rule templates exist, screening according to the priority rules in the rule template knowledge base and outputting a final matching result.
The template matching buffer is used for providing a buffer service in the template matching process, and the overall matching efficiency is improved.
In summary, the short text matching method and device based on the inverted index of the present invention, under the condition that the advanced expression capability of part of regular expressions is provided, accelerates the text matching efficiency through the inverted index technology, can be applied to an intelligent customer service and question-answering system, efficiently matches the questions input by the user, and outputs the result. Compared with the deep learning model, the method has no requirement on data volume, but needs a practitioner to manually maintain some rule tables and word tables. By using the method and the device, flexible and complex text matching rule setting can be carried out, and meanwhile, the high-efficiency execution of the matching process is ensured.
Example two
The embodiment is based on the short text matching method and device based on the inverted index, which are provided by the embodiment, and the short text matching is realized for one time, wherein:
the predefined rule templates are as follows:
1. what is flower
2. How to use flower
3. You good
And defines a priority of 2 greater than 1.
The predefined good phrase dictionary is as follows:
1. flower
2. Is that
3. What is
4. How to use
The predefined phrase mapping table is as follows:
1. what is what
The text entered is as follows:
1. please ask what we are about? How should flowers be used?
Then the following steps are carried out:
1) what the bei of flower is extracted by the feature extraction device to obtain the phrase and the phrase position in the dictionary, and according to the example, the extracted features are as follows:
1. (flower bei, 3)
(yes, 5)
(what, 6)
(flower bei, 8)
(how to use, 13)
2) The extracted features are expanded by a feature expander, according to the example, "what" is mapped to, "the expanded features are:
1. (flower bei, 3)
(yes, 5)
(what, 6)
(flower bei, 8)
(how to use, 13)
(what, 6)
3) The expanded features are generated into an inverted index by an inverted index generator, and according to the example, the generated inverted index is:
1. flower [3,8]
2. Is [5]
3. What [6]
4. How to use [13]
5. What [6]
4) And matching the generated inverted index with all rule templates one by one, wherein according to the example, the matching result is as follows:
1. what is flower
2. How to use flower
5) The matching result is processed according to the priority, in this embodiment, rule 2 has higher priority than 1, so the matching result output last is:
how much flowers are used.
In summary, according to the method and apparatus for matching short texts based on inverted indexes provided in the embodiment, it is achieved that "asking for a question about what is? How should flowers be used? The short text matching has accurate output result and high matching efficiency.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (7)

1. The short text matching method based on the inverted index is characterized in that: the method comprises the following steps:
s1, feature extraction: extracting features of an input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text;
s2, feature expansion: expanding the features extracted in the step S1, and matching the extracted synonyms or category names of the phrases with the positions of the phrases in the text to serve as new features;
s3, generating an inverted index: establishing an inverted index for all the characteristics;
s4, rule matching: matching the inverted index with a preset rule template in sequence, and outputting a matching result;
s5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.
2. The inverted index-based short text matching method as claimed in claim 1, wherein: the feature extraction specifically comprises the following steps:
presetting a phrase dictionary, performing phrase matching on an input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time;
if the two phrases are mutually overlapped, selecting the phrase with longer length and discarding the phrase with shorter length; if the lengths are the same, the top phrase is selected.
3. The inverted index-based short text matching method according to claim 2, characterized in that: the feature extension specifically is:
and presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table.
4. The inverted index-based short text matching method according to claim 3, characterized in that: the rule matching specifically comprises:
presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure.
5. The inverted index-based short text matching method as claimed in claim 4, wherein: the output result is specifically as follows:
for all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict;
when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base;
if no conflict exists, outputting the serial number of the rule template which is successfully matched as an output result; and if the conflict continues to exist, outputting the result that all matching is judged to fail.
6. Short text matching device based on inverted index, its characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;
the rule template knowledge base comprises a plurality of predefined rule templates and information of relative priorities among the rule templates;
the feature extractor comprises a preset phrase dictionary, and is used for extracting phrases existing in the phrase dictionary and the input text at the same time during operation;
the feature extender comprises a predefined phrase mapping table and is used for extending the features extracted by the feature extractor during operation;
the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander;
the template compiler is used for compiling the predefined rule template in the rule template knowledge base;
and the template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and screening and outputting a final matching result according to the priority rule in the rule template knowledge base if a plurality of successfully matched rule templates exist.
7. The inverted index-based short text matching apparatus as claimed in claim 6, wherein: the template matching cache is used for providing cache service in the template matching process and accelerating the overall matching efficiency.
CN202010328205.8A 2020-04-23 2020-04-23 Short text matching method and device based on inverted index Pending CN111581329A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010328205.8A CN111581329A (en) 2020-04-23 2020-04-23 Short text matching method and device based on inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010328205.8A CN111581329A (en) 2020-04-23 2020-04-23 Short text matching method and device based on inverted index

Publications (1)

Publication Number Publication Date
CN111581329A true CN111581329A (en) 2020-08-25

Family

ID=72114965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010328205.8A Pending CN111581329A (en) 2020-04-23 2020-04-23 Short text matching method and device based on inverted index

Country Status (1)

Country Link
CN (1) CN111581329A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395885A (en) * 2020-11-27 2021-02-23 安徽迪科数金科技有限公司 Short text semantic understanding template generation method, semantic understanding processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040078190A1 (en) * 2000-09-29 2004-04-22 Fass Daniel C Method and system for describing and identifying concepts in natural language text for information retrieval and processing
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040078190A1 (en) * 2000-09-29 2004-04-22 Fass Daniel C Method and system for describing and identifying concepts in natural language text for information retrieval and processing
CN103902652A (en) * 2014-02-27 2014-07-02 深圳市智搜信息技术有限公司 Automatic question-answering system
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
江有福等: "自然语言网络答疑系统中倒排索引技术的研究与实现" *
齐翌辰;王森淼;赵亚慧;: "基于倒排索引的问答系统的设计与实现" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395885A (en) * 2020-11-27 2021-02-23 安徽迪科数金科技有限公司 Short text semantic understanding template generation method, semantic understanding processing method and device
CN112395885B (en) * 2020-11-27 2024-01-26 安徽迪科数金科技有限公司 Short text semantic understanding template generation method, semantic understanding processing method and device

Similar Documents

Publication Publication Date Title
CN106649742B (en) Database maintenance method and device
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
KR102256240B1 (en) Non-factoid question-and-answer system and method
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
CN106776564B (en) Semantic recognition method and system based on knowledge graph
CN103970798B (en) The search and matching of data
CN108664599B (en) Intelligent question-answering method and device, intelligent question-answering server and storage medium
CN110276080B (en) Semantic processing method and system
CN111104803B (en) Semantic understanding processing method, device, equipment and readable storage medium
CN107665188B (en) Semantic understanding method and device
CN109508441B (en) Method and device for realizing data statistical analysis through natural language and electronic equipment
CN105893351B (en) Audio recognition method and device
CN108108344B (en) Method and device for jointly recognizing and connecting named entities
CN111178076A (en) Named entity identification and linking method, device, equipment and readable storage medium
CN113742446A (en) Knowledge graph question-answering method and system based on path sorting
CN110825840B (en) Word bank expansion method, device, equipment and storage medium
CN117539990A (en) Problem processing method and device, electronic equipment and storage medium
JPH0922414A (en) Document sorting supporting method and its device
CN106653006A (en) Search method and device based on voice interaction
CN117725183A (en) Reordering method and device for improving retrieval performance of AI large language model
CN111581329A (en) Short text matching method and device based on inverted index
CN117828057A (en) Knowledge question-answering method, device, equipment and storage medium
CN113190692A (en) Self-adaptive retrieval method, system and device for knowledge graph
CN110750632B (en) Improved Chinese ALICE intelligent question-answering method and system
CN117609460A (en) Intelligent question-answering method and device based on keyword semantic decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200825

RJ01 Rejection of invention patent application after publication