CN111581329A

CN111581329A - Short text matching method and device based on inverted index

Info

Publication number: CN111581329A
Application number: CN202010328205.8A
Authority: CN
Inventors: 陈恒生; 叶浩
Original assignee: Shanghai Duiguan Information Technology Co ltd
Current assignee: Shanghai Duiguan Information Technology Co ltd
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2020-08-25

Abstract

The invention is suitable for the technical field of natural language processing, and provides a short text matching method and a short text matching device based on inverted indexes, wherein the method comprises the following steps: and (4) extracting the features of the input characters, matching the extracted features with rule templates in a knowledge base one by one, and searching the most appropriate template. Particularly, after the characteristics are extracted, the invention adopts the inverted index technology to establish the inverted index for the input characters, optimizes the calculation efficiency during matching and greatly accelerates the process of matching with templates in a knowledge base one by one. The device comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher. The invention can be applied to the problem matching in intelligent customer service and question-answering systems or the user input matching in other information retrieval scenes, can also carry out flexible and complex text matching rule setting, and simultaneously ensures the high-efficiency execution of the matching process.

Description

Short text matching method and device based on inverted index

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a short text matching method and device based on an inverted index.

Background

Natural language processing is a technology for studying interaction between a human and a computer through natural language, and text matching is an important task in natural language processing. In a question-answering system, the user's question can be answered by matching the text of the question input by the user with all questions in a knowledge base established in advance and finding the answer to the matched question. Text matching generally includes matching between text and text, and matching between text and rule templates. The inverted index is a technology for searching records according to the value of the attribute, and generally relates to the field of information retrieval to accelerate the efficiency of full-text search of a search engine.

At present, text matching generally includes two ways of matching between texts and rule templates. The matching technology between the text and the text is simple to use, but semantic matching is often not accurate enough, the technology based on deep learning at present has certain breakthrough in accuracy, but the demand on data volume is large, and the technology is not accurate enough under the condition of less data. The regular expression technology needs a certain professional knowledge base when used, is not intuitive and is easy to make mistakes, and particularly, when the occurrence frequency of the operational character is large, the performance index level is reduced by a fuzzy matching mechanism of the regular expression technology.

In addition, the method proposed by CN201811241976 is simple and easy to use compared with the regular expression, and the performance is fast, but the matching capability is not strong enough, and cannot support the rule based on the front and rear positions of the phrase.

Disclosure of Invention

The invention provides a short text matching method and device based on inverted indexes, and mainly aims to be used in the field of intelligent customer service question answering, accurately match questions and correctly answer questions of users.

The invention is realized in this way, the short text matching method based on the inverted index includes the following steps:

s1, feature extraction: extracting features of an input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text;

s2, feature expansion: expanding the features extracted in the step S1, and matching the extracted synonyms or category names of the phrases with the positions of the phrases in the text to serve as new features;

s3, generating an inverted index: establishing an inverted index for all the characteristics;

s4, rule matching: matching the inverted index with a preset rule template in sequence, and outputting a matching result;

s5, outputting a result: and selecting a rule template with the highest priority as output according to the matched result and the preset priority relation among the rule templates.

Preferably, the feature extraction specifically comprises:

presetting a phrase dictionary, performing phrase matching on an input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time;

if the two phrases are mutually overlapped, selecting the phrase with longer length and discarding the phrase with shorter length; if the lengths are the same, the top phrase is selected.

Preferably, the feature extension is specifically:

and presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table.

Preferably, the rule matching specifically includes:

presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure.

Preferably, the output result is specifically:

for all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict;

when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base;

if no conflict exists, outputting the serial number of the rule template which is successfully matched as an output result; and if the conflict continues to exist, outputting the result that all matching is judged to fail.

The invention also provides a short text matching device based on the inverted index, which is characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;

the rule template knowledge base comprises a plurality of predefined rule templates and information of relative priorities among the rule templates;

the feature extractor comprises a preset phrase dictionary, and is used for extracting phrases existing in the phrase dictionary and the input text at the same time during operation;

the feature extender comprises a predefined phrase mapping table and is used for extending the features extracted by the feature extractor during operation;

the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander;

the template compiler is used for compiling the predefined rule template in the rule template knowledge base;

and the template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and screening and outputting a final matching result according to the priority rule in the rule template knowledge base if a plurality of successfully matched rule templates exist.

Preferably, the template matching system further comprises a template matching buffer, which is used for providing a buffer service in the template matching process and accelerating the overall matching efficiency.

Compared with the prior art, the invention has the beneficial effects that: according to the short text matching method and device based on the inverted index, the features of the input characters are extracted, the extracted features are matched with the rule templates in the knowledge base one by one, the most appropriate template is found, and after the features are extracted, the inverted index is established for the input characters by adopting the inverted index technology, so that the calculation efficiency during matching is optimized, the process of matching the characters with the templates in the knowledge base one by one is greatly accelerated, flexible and complex text matching rule setting can be performed, and meanwhile, the high-efficiency execution of the matching process is guaranteed.

Drawings

Fig. 1 is a flowchart illustrating a short text matching method based on inverted indexes according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

Referring to fig. 1, the present embodiment provides a technical solution: the short text matching method based on the inverted index comprises the following steps:

s1, feature extraction: and performing feature extraction on the input text, wherein the features are composed of a plurality of phrases contained in the text and positions of the phrases in the text.

Presetting a phrase dictionary, performing phrase matching on the input text by using a trie tree, and extracting phrases existing in the phrase dictionary and the input text at the same time. The predefined dictionary may be a plurality of lines of text, one phrase for each line.

If there is mutual overlap between two phrases, the longer phrase is selected and the shorter one is discarded. If the lengths are the same, the top phrase is selected.

A feature Fx, consisting of an extracted phrase Fxs and the phrase's position Fxp:

Fx＝(Fxs,Fxp)。

all the extracted features { F1, F2, F3 … Fn } constitute the input text extracted feature table.

S2, feature expansion: and expanding the features extracted in the step S1, and using the synonyms or category names of the extracted phrases as new features according to the positions of the phrases in the text. And presetting a phrase mapping table, wherein the phrase mapping table is used for mapping phrases in the extracted features and adding the phrases as new features into the feature table. The mapped values may be synonyms of phrases or classification names of phrases, and the same phrase may be mapped to multiple different values. The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.

That is, for any extracted feature Fx ═ (Fxs, Fxp), the phrase Fxs maps a series of values Fxsm1, Fxsm2, Fxsm3.. Fxsmn, and (Fxsm1, Fxp), (Fxsm2, Fxp), (Fxsm3, Fxp.) (Fxsmn, Fxp) are added to the feature table as new features, respectively.

The predefined phrase map may be a plurality of lines of text, where each line may be partitioned into two segments by tab, the first segment being a phrase and the second segment being an unphrased value.

S3, generating an inverted index: an inverted index is built for all features. And generating an inverted index table according to the expanded feature table, and establishing an index from Fxs to Fxp for the feature Fx (Fxs, Fxp) in any expanded feature table, so that the position of the phrase in the text can be found through the phrase or the mapped value of the phrase.

The same phrase or the mapped value of the phrase may exist multiple times, so that the position value of the text which is inquired out may exist multiple times.

S4, rule matching: and matching the inverted index with a preset rule template in sequence, and outputting a matching result. Presetting a rule template knowledge base, wherein the rule template knowledge base comprises a plurality of rule templates, and then matching the inverted index with each rule template in the rule template knowledge base, wherein each matching result is success or failure. A rule template is a line of text and consists of a series of values and the sequence between the values, and/or non-logical relations. During operation, all rule templates defined through the text form can be compiled into objects in the memory, and efficiency is improved.

And regarding all the successfully matched rule templates, if the number exceeds one, determining that the matching results have conflict.

And when the conflict exists, discarding the successfully matched templates with lower priority according to the relative priority among the templates preset in the rule template knowledge base.

If no conflict exists, the output result is the number of a successfully matched rule template. If the conflict continues to exist, the output result is null, which indicates that all matching is judged to fail, and a warning of rule conflict is recorded in the log.

The embodiment also provides a short text matching device based on the inverted index, which is used for applying the short text matching method based on the inverted index, and comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler, a template matcher and a template matching buffer.

The rule template knowledge base contains a predefined plurality of rule templates and information of relative priorities between the plurality of rule templates.

The feature extractor includes a pre-set phrase dictionary, and the feature extractor is operable to extract phrases that are present in both the phrase dictionary and the input text.

The feature extender comprises a predefined phrase mapping table and extends the features extracted by the feature extractor during operation.

And the reverse index generator is used for generating a reverse index for the rule expanded by the feature expander.

The template compiler is used for compiling the predefined rule templates in the rule template knowledge base.

The template matcher is used for matching the generated inverted indexes with the objects compiled by the rule templates in the knowledge base one by one, and if a plurality of successfully matched rule templates exist, screening according to the priority rules in the rule template knowledge base and outputting a final matching result.

The template matching buffer is used for providing a buffer service in the template matching process, and the overall matching efficiency is improved.

In summary, the short text matching method and device based on the inverted index of the present invention, under the condition that the advanced expression capability of part of regular expressions is provided, accelerates the text matching efficiency through the inverted index technology, can be applied to an intelligent customer service and question-answering system, efficiently matches the questions input by the user, and outputs the result. Compared with the deep learning model, the method has no requirement on data volume, but needs a practitioner to manually maintain some rule tables and word tables. By using the method and the device, flexible and complex text matching rule setting can be carried out, and meanwhile, the high-efficiency execution of the matching process is ensured.

Example two

The embodiment is based on the short text matching method and device based on the inverted index, which are provided by the embodiment, and the short text matching is realized for one time, wherein:

the predefined rule templates are as follows:

1. what is flower

2. How to use flower

3. You good

And defines a priority of 2 greater than 1.

The predefined good phrase dictionary is as follows:

1. flower

2. Is that

3. What is

4. How to use

The predefined phrase mapping table is as follows:

1. what is what

The text entered is as follows:

1. please ask what we are about? How should flowers be used?

Then the following steps are carried out:

1) what the bei of flower is extracted by the feature extraction device to obtain the phrase and the phrase position in the dictionary, and according to the example, the extracted features are as follows:

1. (flower bei, 3)

(yes, 5)

(what, 6)

(flower bei, 8)

(how to use, 13)

2) The extracted features are expanded by a feature expander, according to the example, "what" is mapped to, "the expanded features are:

1. (flower bei, 3)

(yes, 5)

(what, 6)

(flower bei, 8)

(how to use, 13)

(what, 6)

3) The expanded features are generated into an inverted index by an inverted index generator, and according to the example, the generated inverted index is:

1. flower [3,8]

2. Is [5]

3. What [6]

4. How to use [13]

5. What [6]

4) And matching the generated inverted index with all rule templates one by one, wherein according to the example, the matching result is as follows:

1. what is flower

2. How to use flower

5) The matching result is processed according to the priority, in this embodiment, rule 2 has higher priority than 1, so the matching result output last is:

how much flowers are used.

In summary, according to the method and apparatus for matching short texts based on inverted indexes provided in the embodiment, it is achieved that "asking for a question about what is? How should flowers be used? The short text matching has accurate output result and high matching efficiency.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. The short text matching method based on the inverted index is characterized in that: the method comprises the following steps:

2. The inverted index-based short text matching method as claimed in claim 1, wherein: the feature extraction specifically comprises the following steps:

3. The inverted index-based short text matching method according to claim 2, characterized in that: the feature extension specifically is:

4. The inverted index-based short text matching method according to claim 3, characterized in that: the rule matching specifically comprises:

5. The inverted index-based short text matching method as claimed in claim 4, wherein: the output result is specifically as follows:

6. Short text matching device based on inverted index, its characterized in that: the system comprises a rule template knowledge base, a feature extractor, a feature expander, an inverted index generator, a template compiler and a template matcher;

7. The inverted index-based short text matching apparatus as claimed in claim 6, wherein: the template matching cache is used for providing cache service in the template matching process and accelerating the overall matching efficiency.