CN115470332A

CN115470332A - Intelligent question-answering system for content matching based on matching degree

Info

Publication number: CN115470332A
Application number: CN202211074234.1A
Authority: CN
Inventors: 周欣; 司惠菊; 魏娟; 谢仁强; 石丽; 郭雪飞; 董江; 席楠; 翟畅; 徐静; 周露
Original assignee: Beijing Hezhong Dingcheng Technology Co ltd; Service Center Of China Meteorological Administration
Current assignee: Beijing Hezhong Dingcheng Technology Co ltd; Service Center Of China Meteorological Administration
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-12-13
Anticipated expiration: 2042-09-02
Also published as: CN115470332B

Abstract

The invention discloses an intelligent question-answering system for content matching based on matching degree, and a method and a device for content matching based on matching degree, wherein the method comprises the following steps: acquiring query content subjected to format processing; determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each text paragraph, and determining the text paragraphs with the matching degree greater than a first matching degree threshold value as the candidate paragraphs; selecting an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment; determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment; and selecting at least one target sub-paragraph associated with the format-processed query content from the plurality of answer segments based on the matching degree of the format-processed query content and the answer segments.

Description

Intelligent question-answering system for content matching based on matching degree

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an intelligent question-answering system for content matching based on matching degree, and a method and a device for content matching based on matching degree.

Background

The question-answering system based on the knowledge graph technology requires that the special knowledge of the target field is expressed in a knowledge graph mode, and meanwhile, unstructured question content of a user is converted into a graph query statement in a structured form. The common technology comprises two modes of semantic analysis and path retrieval, wherein the semantic analysis is carried out on the problem of a user in the former mode, and the problem is directly converted into a query statement of a map, so that an answer is obtained through query; the latter is more beneficial to processing complex problems, can provide a search path of the problem in a multi-hop mode, and has strong interpretability. However, constructing a knowledge-graph of the expertise of a particular target area is not itself a simple matter, and thus the prerequisites for prior art solutions are relatively harsh and difficult to meet.

The question-answer pair detection technology firstly needs to arrange all the special knowledge in a specific target field into a question-answer pair form and store the question-answer pair form in advance as a question-answer pair library. And then, the answers to the questions asked by the user are carried out in a mode of matching the questions of the user with the questions in the question-answer pair library, and the answers in the question-answer pairs obtained after matching are returned. The method is simple and direct, but the quality of the question and answer depends on the question and answer pairs stored in advance, and the establishment of the question and answer pair library in the early stage can be a very expensive project.

Accordingly, there is a need in the art for an intelligent question and answer system.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an intelligent question-answering system based on a reordering reading understanding algorithm, which can intelligently process various types of documents of a target system.

The invention relates to a dialogue system aiming at various types of rules and other special knowledge and the answer space is relatively closed, which is different from a chat type and an instruction type dialogue system. The question-answering system with the knowledge query characteristics mainly comprises a knowledge-based map technology, a question-answering pair detection technology, a document question-answering technology and the like.

The technical scheme provided by the invention is different from the main technology in the prior art, and the technology related to the technical scheme provided by the invention mainly comprises problem natural language understanding and knowledge matching technology. The system firstly obtains a reordering system based on multiple documents through training, wherein, in the first step, the multiple documents are divided into paragraphs, the pre-trained BERT network is used for coding the paragraphs and typical answers, a specific loss function is adopted for training the BERT network, text matching is carried out on the document paragraphs and typical questions, a threshold value is set, and the paragraphs and the question pairs with low matching degree are filtered to form candidate paragraphs and question matching pairs; and secondly, designing another pretrained BERT network to encode the candidate paragraphs and the question matching pairs, and training start-stop position information of accurate answer segments contained in the predicted paragraphs of the network by adopting another loss function based on cross entropy, namely predicting answers of accurate matching questions from characters of the matched paragraphs. The training process is completed in advance in an off-line manner.

The trained system ranks the alternative answers to the user question in an online mode, the ranking criterion comprehensively considers the results of the two steps, namely the matching degree of the user question and each alternative paragraph and the matching degree of the user question and each alternative answer, the latter is multiplied by the former after logarithmic smoothing, ranks all the alternative answers according to the result, and returns the first N answers in the ranking.

According to an aspect of the present invention, there is provided a method for content matching based on matching degree, the method including:

acquiring original query content input by a user, and performing format processing on the original query content to acquire the query content subjected to format processing;

determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each text paragraph in a plurality of text paragraphs in a text content library, and determining the text paragraphs with the matching degree of the candidate paragraphs being greater than a first matching degree threshold value as the candidate paragraphs;

selecting an answer segment associated with the format-processed query content in each candidate paragraph, and determining the matching degree of the format-processed query content and the answer segment of each answer segment;

determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment; and

selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content with an answer segment.

Preferably, the formatting the original query content to obtain formatted query content includes:

acquiring a content processing rule for performing format processing on original query content;

and performing format processing on the original query content based on a content processing rule to obtain the query content subjected to format processing.

Preferably, before obtaining the original query content input by the user,

segmenting each document in the plurality of documents in the text content library according to a natural segment to obtain a plurality of natural segments;

a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.

Preferably, the method also comprises the following steps of,

determining the number of characters in each text paragraph;

determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed;

and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is smaller than or equal to a character number threshold value.

Preferably, the determining the matching degree of the format-processed query content and the candidate paragraphs of each text paragraph in the plurality of text paragraphs in the text content library includes:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature codes u of the format-processed query content query _q ：

u _q ＝Bert ₁ (query)

Language characterization model Bert pre-trained using Bert ₁ Determining each text passage p _j By semantic feature coding

Calculating the matching degree of the query content after format processing and the candidate paragraph of the jth text paragraph in a plurality of text paragraphs in the text content library

Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.

Preferably, when determining that the formatted query content matches a candidate paragraph of each of a plurality of text paragraphs in the content library, determining a text paragraph with a candidate paragraph matching degree greater than a first matching degree threshold as a candidate paragraph, the following loss function is involved:

wherein, λ is a hyper-parameter, Ω _- A set of documents that are irrelevant to the query content query after format processing; omega ₊ Is a collection of documents related to the formatted query content query.

Preferably, after determining the text paragraphs for which the candidate paragraphs have a degree of match greater than the first degree of match threshold as candidate paragraphs, the candidate paragraphs are formed into a set of candidate paragraphs:

preferably, the selecting an answer segment associated with the formatted query content in each candidate passage comprises:

language characterization model Bert pre-trained using Bert ₂ Determining semantic feature encodings u for answer fragments associated with the format-processed query content _qj ：

u _qj ＝Bert ₂ (concat(query，p _j ))

Determining the starting position I of the answer segment in the candidate paragraph _start And an end position I _end ：

Wherein,

is a weight matrix of the starting position,

to weight matrix with end position, softmax is the activation function, P _start As starting position probability, P _end To end position probability, len (p) _j ) Is p _j The character length of (2);

based on the starting position I _start And an end position I _end At each candidate paragraph p _j To select an answer segment associated with the formatted query content.

Preferably, in selecting the answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:

L＝αCE(P _start ，Label _start )+βCE(P _end ，Label _end )+γCE(P _span ，Label _span )

where CE represents the cross entropy loss function, label _start For the starting position of the standard answer Label, label _end Label as the end position of the Standard answer Label _span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.

Preferably, determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment comprises:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature coding of answer fragments for a jth candidate paragraph

Determining the formatted query content u _q Matching degree of answer segment with j-th answer segment

Wherein, a _j Is the answer fragment of the jth candidate paragraph.

Preferably, the determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment includes:

matching degree of the answer segment

Performing logarithmic smoothing to obtain the matching degree after smoothing

Based on candidate paragraph matching degree

And degree of matching by smoothing

Determining the matching degree s of the query content subjected to format processing and the answer fragment:

where f is a log smoothing function.

Preferably, selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content with an answer segment comprises:

sorting the answer fragments according to the descending order of the matching degree of the query contents and the answer fragments after format processing so as to generate a sorted list;

acquiring a preset extraction parameter N, and selecting N answer fragments with the maximum matching degree from the sorted list;

and determining at least one answer segment with the matching degree larger than a second matching degree threshold value in the N answer segments with the maximum matching degree as a target subsection.

According to another aspect of the present invention, there is provided an apparatus for content matching based on a degree of matching, the apparatus including:

the processing unit is used for acquiring original query content input by a user and carrying out format processing on the original query content to acquire the query content subjected to format processing;

a first determining unit, configured to determine a matching degree between the query content subjected to format processing and a candidate paragraph of each text paragraph in a plurality of text paragraphs in a text content library, and determine a text paragraph with the matching degree greater than a first matching degree threshold as a candidate paragraph;

a second determining unit, configured to select an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determine a matching degree of the query content subjected to format processing and the answer segment of each answer segment;

a third determining unit, configured to determine, based on the candidate paragraph matching degree and the answer segment matching degree, a matching degree between the query content subjected to format processing and the answer segment; and

and the selecting unit is used for selecting at least one target sub-paragraph associated with the query content subjected to format processing from a plurality of answer segments based on the matching degree of the query content subjected to format processing and the answer segments.

According to another aspect of the present invention, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program for executing the method according to any of the above embodiments.

According to another aspect of the present invention, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method of any one of the above embodiments.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product including computer readable code, when the computer readable code runs on a device, a processor in the device executes a method for implementing any of the embodiments.

The innovation of the invention mainly comes from two points: firstly, answer screening is carried out on a given user question by integrating two similarity calculations, wherein the first step is mainly to match a text, the second step is to extract answer fragments based on a reading understanding algorithm from a candidate paragraph, and the reordering simultaneously considers two aspects of text matching and reading understanding; secondly, a unique loss function is adopted to train the matching degree of the problem and the candidate paragraphs.

The main advantages of the present invention result from the two innovative points described above. The reordering method can comprehensively consider the text matching degree of the user question and the candidate paragraph and the similarity between the user question and the accurate answer in the candidate paragraph, and the accuracy and the stability of answer screening are improved; the loss function used in the training of the first-step problem and candidate paragraph matching network can ensure that the paragraphs relevant to the problem can be accurately selected.

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

FIG. 1 is a flow chart of a method for matching content based on matching degree according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method for understanding an algorithm based on multiple document re-ranking reads, according to an embodiment of the invention;

FIG. 3 is a model diagram of a multiple document re-ranking based reading understanding algorithm according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for matching content based on matching degree according to an embodiment of the present invention.

Detailed Description

Fig. 1 is a flowchart of a method for matching content based on matching degree according to an embodiment of the present invention. As shown in fig. 1, the method 100 includes:

step 101, obtaining original query content input by a user, and performing format processing on the original query content to obtain the query content subjected to format processing.

In one embodiment, formatting original query content to obtain formatted query content includes: acquiring a content processing rule for format processing of original query content; and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.

In one embodiment, before obtaining the original query content input by the user, the method further includes segmenting each document of the plurality of documents in the text content library according to the natural segment to obtain a plurality of natural segments; a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.

In one embodiment, further comprising, determining a number of characters in each text passage; determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed; and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.

Step 102, determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each of the plurality of text paragraphs in the text content library, and determining the text paragraphs with the matching degree greater than a first matching degree threshold value as the candidate paragraphs.

In one embodiment, determining a degree of matching of the formatted query content to a candidate passage for each of a plurality of text passages within the textual content library includes:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature codes u of formatted query content query _q ：

u _q ＝Bert ₁ (query)

Language characterization model Bert pre-trained using Bert ₁ Determining each text paragraph p _j Semantic feature coding of

Calculating the matching degree of the query content subjected to format processing and the candidate paragraph of the jth text paragraph in a plurality of text paragraphs in the text content library

In one embodiment, when determining that the formatted query content matches a candidate paragraph for each of a plurality of text paragraphs within the corpus of text, determining as the candidate paragraph the text paragraph for which the candidate paragraph matches greater than the first threshold of matching, the following penalty function is involved:

wherein, lambda is a hyper-parameter, omega _- The document set is irrelevant to the query content query after format processing; omega ₊ Is a collection of documents related to the formatted query content query.

In one embodiment, after determining a text passage for which the candidate passage has a degree of match greater than a first degree of match threshold as a candidate passage, the candidate passage is formed into a set of candidate passages:

step 103, selecting an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment.

In one embodiment, selecting an answer segment associated with the formatted query content in each candidate passage comprises:

language characterization model Bert pre-trained using Bert ₂ Determining semantic feature encodings u for answer fragments associated with formatted query content _qj ：

u _qj ＝Bert ₂ (concat(query，p _j ))

Wherein,

is a weight matrix of the starting position,

to be a weight matrix with the end position, softmax is an activation function, P _start As starting position probability, P _end To end position probability, len (p) _j ) Is p _j The character length of (2);

In one embodiment, in selecting an answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:

L＝αCE(Ps _tart ，Label _start )+βCE(P _end ，Label _end )+γCE(P _span ，Label _span )

And step 104, determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment.

In one embodiment, determining the degree of matching between the formatted query content and the answer segment of each answer segment comprises:

Determining formatted query content u _q Matching degree of answer segment with j-th answer segment

Wherein, a _j Is the answer fragment of the jth candidate paragraph.

In one embodiment, determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment comprises:

degree of matching to answer segment

Performing logarithmic smoothing to obtain the matching degree after smoothing

Based on candidate paragraph matching degree

And degree of matching after smoothing

where f is a log smoothing function.

And 105, selecting at least one target sub-paragraph associated with the format-processed query content from the plurality of answer segments based on the matching degree of the format-processed query content and the answer segments.

In one embodiment, selecting at least one target sub-paragraph from the plurality of answer segments that is associated with the formatted query content based on a degree of matching of the formatted query content to the answer segments comprises:

According to an alternative, the method comprises: step 1011, obtaining the original query content input by the user, and performing format processing on the original query content to obtain the query content subjected to format processing.

At step 1012, the matching degree of the formatted query content and the text of each text paragraph in the plurality of text paragraphs in the text content library is determined, and the text paragraphs with the matching degree greater than the first threshold matching degree are determined as candidate paragraphs.

Step 1013, selecting a result sub-paragraph associated with the format-processed query content in each candidate paragraph, and determining a result matching degree of the format-processed query content and each result sub-paragraph.

And 1014, determining the matching degree of the query content subjected to format processing and the result subsection based on the text matching degree and the result matching degree.

Step 1015, selecting at least one target sub-paragraph from the plurality of result sub-paragraphs that is associated with the formatted query content based on the matching degree between the formatted query content and the result sub-paragraphs.

The format processing of the original query content to obtain the format-processed query content includes:

acquiring a content processing rule for format processing of original query content;

and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.

Also included prior to obtaining the original query content entered by the user,

segmenting each document in a plurality of documents in a text content library according to a natural segment to obtain a plurality of natural segments;

Further comprising, determining the number of characters in each text paragraph;

and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.

Determining a text matching degree between the query content subjected to format processing and each text passage in a plurality of text passages in the text content library comprises the following steps:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature codes u of format-processed query content query _q ：

u _q ＝Bert ₁ (query)

Language characterization model Bert pre-trained using Bert ₁ Determining each text paragraph p _j By semantic feature coding

Calculating the text matching degree of the query content subjected to format processing and the jth text paragraph in a plurality of text paragraphs in the text content library

When determining the text matching degree of the query content subjected to format processing and each text passage in a plurality of text passages in the text content library, and determining the text passage with the text matching degree larger than a first matching degree threshold value as a candidate passage, the following loss functions are involved:

wherein, λ is a hyper-parameter, Ω _- The document set is irrelevant to the query content query after format processing; omega ₊ Is a collection of documents that are relevant to the formatted query content query.

After determining the text paragraphs with the text matching degree larger than the first matching degree threshold value as candidate paragraphs, forming the candidate paragraphs into a candidate paragraph set:

selecting a result sub-paragraph associated with the formatted query content in each of the candidate paragraphs, comprising:

language characterization model Bert pre-trained using Bert ₂ Determining semantic feature encodings u for result sub-paragraphs associated with formatted query content _qj ：

u _qj ＝Bert ₂ (concat(query，p _j ))

Determining a starting position I where a result subsection falls within a candidate paragraph _start And an end position I _end ：

Wherein,

is a weight matrix of the starting position,

to weight matrix with end position, softmax is the activation function, P _start As starting position probability, P _end To end position probability, len (p) _j ) Is p _j The character length of (d);

based on the starting position I _start And an end position I _end At each candidate paragraph p _j Selects a result sub-paragraph associated with the formatted query content.

In selecting the result sub-paragraph associated with the formatted query content in each candidate paragraph, the following penalty function is involved:

where CE represents the cross entropy loss function, label _start As the starting position of the standard answer Label, label _end Label as the end position of the Standard answer Label _span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.

In one embodiment, filtering the candidate answers (e.g., filtering the candidate answers or answer snippets in the candidate passage) is also included. In particular, the starting position I at which the resulting sub-segment falls within the candidate paragraph is determined according to the above _start And an end position I _end The formula (a) determines the matching degree of the candidate answer (or determines the score, the matching value, the matching score and the like of the candidate answer), the matching degree (the score, the matching value, the matching score and the like) takes the average value of the probabilities of the starting position and the ending position, and the relevant candidate answer set meeting a specific threshold value t2 is filtered and reserved.

In one embodiment, the method further comprises the following steps of predicting the association probability of the query content query and the candidate answer: and after the answer segments of the candidate paragraphs are predicted, the association probability between the query content query and the candidate answer segments is predicted.

Determining semantic feature encodings (e.g., a language characterization model Bert pre-trained using Bert) ₁ Determining global semantic feature coding and character level semantic feature coding):

[H _cls ，H _tokens ]＝Bert ₃ ([query，answer])

wherein H _cls Encoding for Global semantic features (Global Embedding);

H _tokens encoding semantic features at the character level (token-level Embedding);

the answer is a candidate answer which is taken from a candidate answer set;

the query is the query content.

In one embodiment, further comprising, performing decomposition of a feature hierarchy (a feature hierarchy of a model for predicting the probability of association of query content query with candidate answer): the features are hierarchically decomposed into an intent layer, a core entity layer, and a relationship layer. The core entity layer carries out Mask on the query content query and the character codes of the non-core entities in the candidate answer; the relation layer carries out Mask on character codes of core entities in the query content query and the candidate answer; the intent layer retains the full character encoding. On the basis, the three layers of character string codes are subjected to matrix transformation, and are respectively expressed as follows after being processed by an average pooling layer:

h _i is an intention layer; h is a total of _ce Is a core entity layer; h is _r Is a relation layer;

determining the probability distribution of the hierarchical features:

where a is the activation function and where a is the activation function,

is a trainable parameter; y is _u Labels that are probability distributions;

determining a global feature probability distribution:

wherein

Is a trainable parameter; y is _g The query is a label of global probability distribution, the query is query content, query sentences or query keywords, and the answer is a candidate answer;

determining a loss function (the loss function in predicting the probability of association of a query with a candidate answer) includes:

determining a global loss function:

L _G ＝-logP(y _g |query，answer)

determining a distribution difference loss function:

L _D ＝F(P(y _u |query，answer)，P(y _g |query，answer))+F(P(y _g |query，answer)，P(y _u |query，answer))

wherein

Determining a loss function:

L＝L _G +λL _D

wherein λ is a hyperparameter

Prediction of association probability:

reorder to obtain answers (answers matching or corresponding to query content query)For example, selecting topN answers from the candidate answers): according to the loss function as the matching degree of the query content query and the candidate answer (for example,

) And reordering all answer candidate sets, and selecting the answer of topN (N is a natural number, so that the N answers with the largest matching degree are selected) as the final result.

In one embodiment, determining the result matching degree of the query content subjected to format processing and each result sub-paragraph comprises:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature encodings for a result sub-paragraph of a jth candidate paragraph

Determining format-processed query content u _q Degree of matching with the result of the jth result sub-paragraph

Wherein, a _j Is the result sub-paragraph of the jth candidate paragraph.

Determining the matching degree of the query content subjected to format processing and the result subsection based on the text matching degree and the result matching degree, wherein the step comprises the following steps:

degree of matching of results

Performing logarithmic smoothing to obtain the matching degree after the smoothing

Based on text matching degree

And degree of matching by smoothing

Determining the matching degree s of the query content subjected to format processing and the result sub-section:

where f is a log smoothing function.

Selecting at least one target sub-paragraph from the plurality of result sub-paragraphs that is associated with the formatted query content based on a degree of matching of the formatted query content to the result sub-paragraph, including:

sorting the result sub-paragraphs according to the descending order of the matching degree of the query contents and the result sub-paragraphs after format processing so as to generate a sorted list;

acquiring a preset extraction parameter N, and selecting N result subsections with the maximum matching degree from the sorted list;

and determining at least one result sub-paragraph of the N result sub-paragraphs with the maximum matching degree, wherein the matching degree of the at least one result sub-paragraph is greater than a second matching degree threshold value, as the target sub-paragraph.

FIG. 2 is a flow diagram of a method for understanding an algorithm based on multiple document re-ranking reads in accordance with an embodiment of the present invention.

Typically, document pre-processing is performed first. Firstly, the natural segments of all candidate documents are subjected to preliminary segmentation. Since the header contains clear service information and is highly related to the problem, the multi-level header and each section of content are separately connected together by special symbols, and if the obtained result does not exceed the preset maximum length, the result is taken as the result of preprocessing; otherwise, further cutting is carried out. And finally, obtaining a paragraph candidate set of the multiple documents:

then, the query is determined to match the candidate paragraph relevancy. As shown in FIG. 2, query and paragraphs _1, paragraph _2, \ 8230, and paragraph N are encoded at the semantic coding layer. For example, determining semantic feature encoding: semantic coding is respectively carried out on the query and the full candidate paragraphs, a Bert pre-trained language representation model is adopted, and the coding result is expressed as follows:

u _q ＝Bert ₁ (query)

determining matching degree prediction of query and candidate paragraphs:

determining a loss function:

wherein λ represents a hyper-parameter; omega _{_} Representing a set of candidate documents unrelated to the query; omega ₊ Representing a set of candidate documents associated with the query.

And (3) filtering candidate paragraphs: according to the formula of step 2

Score the candidate paragraphs (e.g., determine a match value, match score, etc. for the candidate paragraphs), and filter the set of relevant candidate paragraphs that remain satisfying threshold t1, represented as

Next, at the semantic matching and answer extraction layer, the answer segments of the candidate paragraphs are predicted. As shown in FIG. 2, paragraph _1, paragraph _2, paragraph _ 8230, paragraph \8230, matching pair of paragraph N and query are constructed, and then paragraph _1, paragraph _2, paragraph \8230, matching pair of paragraph N and answer are constructed. And, finally, determining paragraph _1, paragraph _2, \8230;, matching pairs of paragraph N and answer, and degree of matching of query.

Semantic feature coding: semantic coding is carried out on the query and the related candidate paragraphs, a Bert pre-trained language representation model is still adopted, and the coding result is expressed as:

u _qj ＝Bert ₂ (concat(query，p _j ))

answer start and stop site prediction:

wherein W _s And W _e A weight matrix for the start position and the end position, respectively.

Loss function:

where CE represents the cross entropy loss function, label is the standard answer Label, and span represents the segment from the start to the end position.

At the reordering layer, the optimal answer is obtained by reordering:

predicting matching degree of query and answer:

wherein a is _j Is the answer fragment of the candidate paragraph j predicted in step 2.

And (3) predicting a final answer by combining the matching degree of the query and the candidate paragraphs and the matching degree of the query and the answer:

where f is a log smoothing function.

At the answer prediction layer, sorting answers in the candidate paragraphs according to s, and returning a result that topN is selected and s satisfies a condition larger than a threshold value t 2.

Fig. 3 is a model diagram of a multiple-document re-ranking based reading understanding algorithm according to an embodiment of the present invention. During the operation of the model based on the multi-document reordering reading understanding algorithm, the following contents are realized:

step one, preprocessing a document.

Firstly, the preliminary segmentation is carried out according to the natural segments of all candidate documents. Since the header contains clear service information and is highly related to the problem, the multi-level header and each section of content are separately connected together by special symbols, and if the obtained result does not exceed the preset maximum length, the result is taken as the result of preprocessing; otherwise, further cutting is carried out. And finally, obtaining a paragraph candidate set of the multiple documents:

and (II) building a model and determining an answer corresponding to the query content by using the model.

1. Determining the association degree matching of the query and the candidate paragraphs:

(1.1) carrying out semantic feature coding: semantic coding is respectively carried out on the query and the full candidate paragraphs, a Bert pre-trained language representation model is adopted, and the coding result is expressed as follows:

u _q ＝Bert ₁ (query)

(1.2) determining the matching degree prediction of the query and the candidate paragraph:

(1.3) determining a loss function:

wherein λ represents a hyper-parameter; omega _- Representing a set of candidate documents that are not related to the query;

Ω ₊ representing a set of candidate documents associated with the query.

(1.4) determining filtering candidate paragraphs: determining the degree of matching of the candidate paragraphs according to the formula of step (1.2), for example by scoring the candidate paragraphs to identify the degree of matching, and filtering out the relevant set of candidate paragraphs that remain satisfying a certain threshold t1, denoted as

2. Predicting answer segments of candidate paragraphs:

and (2.1) carrying out semantic feature coding: semantic coding is carried out on the query and the related candidate paragraphs, a Bert pre-trained language characterization model is still adopted, and the coding result is expressed as follows:

u _qj ＝Bert ₂ (concat(query，p _j ))

(2.2) predicting the start site (or position) and the end site (or position) of the answer:

wherein

And

weight matrices for the start and end positions, respectively.

(2.3) determining a loss function:

(2.4) filtering candidate answers: determining a score (e.g., a score of a degree of match) of the candidate answer according to step (2.2), the score being an average of the starting position probability and the ending position probability, and filtering the relevant candidate answer set that remains to satisfy the threshold t 2.

3. Predicting the association probability of the query and the candidate answer: (e.g., predicting answer fragments for candidate paragraphs followed by predicting association probabilities between query and candidate answer fragments)

(3.1) semantic feature coding:

[H _cls ，H _tokens ]＝Bert ₃ ([query，answer])

wherein H _cls Encoding for Global semantic features (Global Embedding);

answer is a candidate answer which is taken from a candidate answer set;

the query is the query content.

(3.2) feature-hierarchical (e.g., feature-hierarchical of a model for predicting the probability of association of a query with a candidate answer) decomposition: the features are hierarchically decomposed into an intent layer, a core entity layer, and a relationship layer. Wherein, the character codes of the non-core entity in the core entity layer mask query and the answer; character coding of core entities in the relation layer mask query and answer; the intent layer retains the full character encoding. On the basis, the three layers of character string codes are subjected to matrix transformation, and are respectively expressed as follows after being processed by an average pooling layer:

h _i is an intention layer; h is _ce Is a core entity layer; h is a total of _r Is a relation layer;

(3.3) joint probability distribution of hierarchical features:

where σ is the activation function, W ₁ Is a trainable parameter; y is _u Labels for joint probability distribution

(3.4) global feature probability distribution:

wherein W ₂ Is a trainable parameter; y is _g Labels that are global probability distributions

(3.5) determining a loss function (the loss function in predicting the association probability of the query with the candidate answer):

(3.5.1) global penalty function:

L _G ＝-logP(y _g |query，answer)

(3.5.2) distribution variance loss function:

wherein

(3.6) joint loss function:

L＝L _G +λL _D

wherein λ is a hyperparameter

(3.7) associated probability prediction:

and reordering to obtain answers: and (3.7) reordering all answer candidate sets according to the matching degree of the query and the answer, and selecting the answer of topN as a final result. Wherein N is a natural number, such as 5, 10, etc.

Fig. 4 is a schematic structural diagram of an apparatus for matching content based on matching degree according to an embodiment of the present invention. The apparatus 400 comprises: a processing unit 401, a first determining unit 402, a second determining unit 403, a third determining unit 404, and a selecting unit 405.

The processing unit 401 is configured to obtain an original query content input by a user, and perform format processing on the original query content to obtain a query content subjected to format processing. The processing unit 401 is specifically configured to obtain a content processing rule for performing format processing on original query content; and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.

The system also comprises a preprocessing unit, a searching unit and a searching unit, wherein the preprocessing unit is used for segmenting each document in a plurality of documents in the text content library according to the natural segments to obtain a plurality of natural segments; a plurality of levels of headings in each document are determined, and each level of heading and at least one natural segment associated with the heading are formed into a text paragraph. The preprocessing unit is further used for determining the number of characters in each text paragraph; determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed; and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.

A first determining unit 402, configured to determine a matching degree between the format-processed query content and a candidate paragraph in each of a plurality of text paragraphs in the text content library, and determine a text paragraph with the matching degree greater than a first matching degree threshold as a candidate paragraph.

The first determination unit 402 is specifically configured to use a language representation model Bert pre-trained with Bert ₁ Determining semantic feature codes u of formatted query content query _q ：

u _q ＝Bert ₁ (query)

Calculating the matching degree of the query content subjected to format processing and the candidate paragraph of the jth text paragraph in the plurality of text paragraphs in the text content library

The first determining unit 402 is specifically configured to, when determining that the query content subjected to format processing matches with a candidate paragraph of each text paragraph in a plurality of text paragraphs in the text content library, and determining a text paragraph whose candidate paragraph matching degree is greater than a first matching degree threshold as a candidate paragraph, involve the following loss functions:

wherein, lambda is a hyper-parameter, omega _- A set of documents that are irrelevant to the query content query after format processing; omega ₊ Is a collection of documents that are relevant to the formatted query content query.

The first determining unit 402 is specifically configured to, after determining that a text passage with a candidate passage matching degree greater than a first matching degree threshold is a candidate passage, form the candidate passage into a candidate passage set:

a second determining unit 403, configured to select an answer segment associated with the format-processed query content in each candidate paragraph, and determine a matching degree between the format-processed query content and the answer segment of each answer segment.

The second determination unit 403 is specifically configured to use the pre-trained language characterization model Bert of Bert ₂ Determining semantic feature encodings u for answer fragments associated with formatted query content _qj ：

u _qj ＝Bert ₂ (concat(query，p _j ))

Wherein,

is a weight matrix of the starting position,

In selecting an answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:

wherein CE representsCross entropy loss function, label _start As the starting position of the standard answer Label, label _end Label as the end position of the standard answer Label _span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.

A third determining unit 404, configured to determine a matching degree between the query content subjected to format processing and the answer segment based on the matching degree between the candidate paragraphs and the matching degree between the answer segments.

The third determining unit 404 is specifically configured to:

Determining format-processed query content u _q Matching degree of answer segment with j-th answer segment

Wherein, a _j Is the answer segment of the jth candidate paragraph.

The third determining unit 404 is specifically configured to determine the matching degree of the answer segment

Based on candidate paragraph matching degree

And degree of matching after smoothing

where f is a log smoothing function.

A selecting unit 405, configured to select at least one target sub-paragraph associated with the format-processed query content from the multiple answer segments based on a matching degree between the format-processed query content and the answer segments.

The selecting unit 405 is specifically configured to sort the answer fragments according to a descending order of matching degrees of the query content and the answer fragments subjected to format processing, so as to generate a sorted list;

acquiring a preset extraction parameter N, and selecting N answer segments with the maximum matching degree from the sorted list;

Claims

1. A method for content matching based on a degree of matching, the method comprising:

determining the matching degree of the candidate paragraphs of the query content subjected to format processing and each of a plurality of text paragraphs in a text content library, and determining the text paragraphs with the matching degree of the candidate paragraphs greater than a first matching degree threshold value as candidate paragraphs;

selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content to an answer segment.

2. The method of claim 1, the formatting the original query content to obtain formatted query content, comprising:

3. The method of claim 1, further comprising, prior to obtaining original query content entered by a user,

4. The method of claim 3, further comprising,

determining the number of characters in each text paragraph;

5. The method of claim 1, the determining a degree of matching of the formatted query content to candidate paragraphs for each of a plurality of text paragraphs within a textual content library, comprising:

language characterization model Bert pre-trained using Bert ₁ Determining semantic feature codes u of the formatted query content query _q ：

u _q ＝Bert ₁ (query)

Calculating the matching degree of the query content after format processing and the candidate paragraph of the jth text paragraph in the plurality of text paragraphs in the text content library

6. The method of claim 5, wherein in determining that the formatted query content matches a candidate passage for each of a plurality of passages of text within the library of contents, determining as the candidate passage a passage having a candidate passage matching greater than a first threshold of matching, involves:

wherein, lambda is a hyper-parameter, omega _- A set of documents that are irrelevant to the query content query after format processing; omega ₊ Is a collection of documents related to the formatted query content query.

7. The method of claim 5, after determining a text passage for which the candidate passage match is greater than the first match threshold as the candidate passage, forming the candidate passage into a candidate passage set:

8. the method of claim 1, the selecting, in each candidate passage, an answer segment associated with the formatted query content, comprising:

language characterization model Bert pre-trained using Bert ₂ Determining semantic feature encodings u for answer fragments associated with the formatted query content _qj ：

u _qj ＝Bert ₂ (concat(query，p _j ))

Wherein,

is a weight matrix of the starting position,

9. The method of claim 1, when selecting an answer segment associated with the formatted query content in each candidate passage, involving the following loss function:

10. The method of claim 1, determining a degree of match of the formatted query content with an answer segment of each answer segment, comprising:

language table pre-trained using BertMark model Bert ₁ Determining semantic feature coding of answer fragments for the jth candidate paragraph

Determining the format-processed query content u _q Matching degree of answer segment with j-th answer segment

Wherein, a _j Is the answer fragment of the jth candidate paragraph.

11. The method of claim 10, wherein determining the matching degree of the formatted query content and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment comprises:

matching degree of the answer segment

Performing logarithmic smoothing to obtain the matching degree after smoothing

Based on candidate paragraph matching degree

And degree of matching by smoothing

where f is a log smoothing function.

12. The method of claim 11, selecting at least one target sub-paragraph from a plurality of answer segments associated with the formatted query content based on a degree of matching of the formatted query content to an answer segment, comprising:

13. An apparatus for content matching based on a degree of matching, the apparatus comprising:

a second determining unit, configured to select, in each candidate paragraph, an answer segment associated with the query content subjected to format processing, and determine a matching degree between the query content subjected to format processing and the answer segment of each answer segment;

14. A computer-readable storage medium, characterized in that the storage medium stores a computer program for performing the method of any of claims 1-12.

15. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor to read the executable instructions from the memory and execute the instructions to implement the method of any of claims 1-12.