CN115470332A - Intelligent question-answering system for content matching based on matching degree - Google Patents

Intelligent question-answering system for content matching based on matching degree Download PDF

Info

Publication number
CN115470332A
CN115470332A CN202211074234.1A CN202211074234A CN115470332A CN 115470332 A CN115470332 A CN 115470332A CN 202211074234 A CN202211074234 A CN 202211074234A CN 115470332 A CN115470332 A CN 115470332A
Authority
CN
China
Prior art keywords
answer
matching degree
candidate
query content
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211074234.1A
Other languages
Chinese (zh)
Other versions
CN115470332B (en
Inventor
周欣
司惠菊
魏娟
谢仁强
石丽
郭雪飞
董江
席楠
翟畅
徐静
周露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hezhong Dingcheng Technology Co ltd
Service Center Of China Meteorological Administration
Original Assignee
Beijing Hezhong Dingcheng Technology Co ltd
Service Center Of China Meteorological Administration
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hezhong Dingcheng Technology Co ltd, Service Center Of China Meteorological Administration filed Critical Beijing Hezhong Dingcheng Technology Co ltd
Priority to CN202211074234.1A priority Critical patent/CN115470332B/en
Publication of CN115470332A publication Critical patent/CN115470332A/en
Application granted granted Critical
Publication of CN115470332B publication Critical patent/CN115470332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intelligent question-answering system for content matching based on matching degree, and a method and a device for content matching based on matching degree, wherein the method comprises the following steps: acquiring query content subjected to format processing; determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each text paragraph, and determining the text paragraphs with the matching degree greater than a first matching degree threshold value as the candidate paragraphs; selecting an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment; determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment; and selecting at least one target sub-paragraph associated with the format-processed query content from the plurality of answer segments based on the matching degree of the format-processed query content and the answer segments.

Description

Intelligent question-answering system for content matching based on matching degree
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to an intelligent question-answering system for content matching based on matching degree, and a method and a device for content matching based on matching degree.
Background
The question-answering system based on the knowledge graph technology requires that the special knowledge of the target field is expressed in a knowledge graph mode, and meanwhile, unstructured question content of a user is converted into a graph query statement in a structured form. The common technology comprises two modes of semantic analysis and path retrieval, wherein the semantic analysis is carried out on the problem of a user in the former mode, and the problem is directly converted into a query statement of a map, so that an answer is obtained through query; the latter is more beneficial to processing complex problems, can provide a search path of the problem in a multi-hop mode, and has strong interpretability. However, constructing a knowledge-graph of the expertise of a particular target area is not itself a simple matter, and thus the prerequisites for prior art solutions are relatively harsh and difficult to meet.
The question-answer pair detection technology firstly needs to arrange all the special knowledge in a specific target field into a question-answer pair form and store the question-answer pair form in advance as a question-answer pair library. And then, the answers to the questions asked by the user are carried out in a mode of matching the questions of the user with the questions in the question-answer pair library, and the answers in the question-answer pairs obtained after matching are returned. The method is simple and direct, but the quality of the question and answer depends on the question and answer pairs stored in advance, and the establishment of the question and answer pair library in the early stage can be a very expensive project.
Accordingly, there is a need in the art for an intelligent question and answer system.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an intelligent question-answering system based on a reordering reading understanding algorithm, which can intelligently process various types of documents of a target system.
The invention relates to a dialogue system aiming at various types of rules and other special knowledge and the answer space is relatively closed, which is different from a chat type and an instruction type dialogue system. The question-answering system with the knowledge query characteristics mainly comprises a knowledge-based map technology, a question-answering pair detection technology, a document question-answering technology and the like.
The technical scheme provided by the invention is different from the main technology in the prior art, and the technology related to the technical scheme provided by the invention mainly comprises problem natural language understanding and knowledge matching technology. The system firstly obtains a reordering system based on multiple documents through training, wherein, in the first step, the multiple documents are divided into paragraphs, the pre-trained BERT network is used for coding the paragraphs and typical answers, a specific loss function is adopted for training the BERT network, text matching is carried out on the document paragraphs and typical questions, a threshold value is set, and the paragraphs and the question pairs with low matching degree are filtered to form candidate paragraphs and question matching pairs; and secondly, designing another pretrained BERT network to encode the candidate paragraphs and the question matching pairs, and training start-stop position information of accurate answer segments contained in the predicted paragraphs of the network by adopting another loss function based on cross entropy, namely predicting answers of accurate matching questions from characters of the matched paragraphs. The training process is completed in advance in an off-line manner.
The trained system ranks the alternative answers to the user question in an online mode, the ranking criterion comprehensively considers the results of the two steps, namely the matching degree of the user question and each alternative paragraph and the matching degree of the user question and each alternative answer, the latter is multiplied by the former after logarithmic smoothing, ranks all the alternative answers according to the result, and returns the first N answers in the ranking.
According to an aspect of the present invention, there is provided a method for content matching based on matching degree, the method including:
acquiring original query content input by a user, and performing format processing on the original query content to acquire the query content subjected to format processing;
determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each text paragraph in a plurality of text paragraphs in a text content library, and determining the text paragraphs with the matching degree of the candidate paragraphs being greater than a first matching degree threshold value as the candidate paragraphs;
selecting an answer segment associated with the format-processed query content in each candidate paragraph, and determining the matching degree of the format-processed query content and the answer segment of each answer segment;
determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment; and
selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content with an answer segment.
Preferably, the formatting the original query content to obtain formatted query content includes:
acquiring a content processing rule for performing format processing on original query content;
and performing format processing on the original query content based on a content processing rule to obtain the query content subjected to format processing.
Preferably, before obtaining the original query content input by the user,
segmenting each document in the plurality of documents in the text content library according to a natural segment to obtain a plurality of natural segments;
a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.
Preferably, the method also comprises the following steps of,
determining the number of characters in each text paragraph;
determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed;
and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is smaller than or equal to a character number threshold value.
Preferably, the determining the matching degree of the format-processed query content and the candidate paragraphs of each text paragraph in the plurality of text paragraphs in the text content library includes:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature codes u of the format-processed query content query q
u q =Bert 1 (query)
Language characterization model Bert pre-trained using Bert 1 Determining each text passage p j By semantic feature coding
Figure BDA0003830862700000041
Figure BDA0003830862700000042
Calculating the matching degree of the query content after format processing and the candidate paragraph of the jth text paragraph in a plurality of text paragraphs in the text content library
Figure BDA0003830862700000043
Figure BDA0003830862700000044
Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.
Preferably, when determining that the formatted query content matches a candidate paragraph of each of a plurality of text paragraphs in the content library, determining a text paragraph with a candidate paragraph matching degree greater than a first matching degree threshold as a candidate paragraph, the following loss function is involved:
Figure BDA0003830862700000045
wherein, λ is a hyper-parameter, Ω - A set of documents that are irrelevant to the query content query after format processing; omega + Is a collection of documents related to the formatted query content query.
Preferably, after determining the text paragraphs for which the candidate paragraphs have a degree of match greater than the first degree of match threshold as candidate paragraphs, the candidate paragraphs are formed into a set of candidate paragraphs:
Figure BDA0003830862700000048
preferably, the selecting an answer segment associated with the formatted query content in each candidate passage comprises:
language characterization model Bert pre-trained using Bert 2 Determining semantic feature encodings u for answer fragments associated with the format-processed query content qj
u qj =Bert 2 (concat(query,p j ))
Determining the starting position I of the answer segment in the candidate paragraph start And an end position I end
Figure BDA0003830862700000046
Figure BDA0003830862700000047
Figure BDA0003830862700000051
Figure BDA0003830862700000052
Wherein,
Figure BDA0003830862700000053
is a weight matrix of the starting position,
Figure BDA0003830862700000054
to weight matrix with end position, softmax is the activation function, P start As starting position probability, P end To end position probability, len (p) j ) Is p j The character length of (2);
based on the starting position I start And an end position I end At each candidate paragraph p j To select an answer segment associated with the formatted query content.
Preferably, in selecting the answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:
L=αCE(P start ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label start For the starting position of the standard answer Label, label end Label as the end position of the Standard answer Label span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.
Preferably, determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment comprises:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature coding of answer fragments for a jth candidate paragraph
Figure BDA0003830862700000055
Figure BDA0003830862700000056
Determining the formatted query content u q Matching degree of answer segment with j-th answer segment
Figure BDA0003830862700000057
Figure BDA0003830862700000058
Wherein, a j Is the answer fragment of the jth candidate paragraph.
Preferably, the determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment includes:
matching degree of the answer segment
Figure BDA0003830862700000061
Performing logarithmic smoothing to obtain the matching degree after smoothing
Figure BDA0003830862700000062
Based on candidate paragraph matching degree
Figure BDA0003830862700000063
And degree of matching by smoothing
Figure BDA0003830862700000064
Determining the matching degree s of the query content subjected to format processing and the answer fragment:
Figure BDA0003830862700000065
where f is a log smoothing function.
Preferably, selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content with an answer segment comprises:
sorting the answer fragments according to the descending order of the matching degree of the query contents and the answer fragments after format processing so as to generate a sorted list;
acquiring a preset extraction parameter N, and selecting N answer fragments with the maximum matching degree from the sorted list;
and determining at least one answer segment with the matching degree larger than a second matching degree threshold value in the N answer segments with the maximum matching degree as a target subsection.
According to another aspect of the present invention, there is provided an apparatus for content matching based on a degree of matching, the apparatus including:
the processing unit is used for acquiring original query content input by a user and carrying out format processing on the original query content to acquire the query content subjected to format processing;
a first determining unit, configured to determine a matching degree between the query content subjected to format processing and a candidate paragraph of each text paragraph in a plurality of text paragraphs in a text content library, and determine a text paragraph with the matching degree greater than a first matching degree threshold as a candidate paragraph;
a second determining unit, configured to select an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determine a matching degree of the query content subjected to format processing and the answer segment of each answer segment;
a third determining unit, configured to determine, based on the candidate paragraph matching degree and the answer segment matching degree, a matching degree between the query content subjected to format processing and the answer segment; and
and the selecting unit is used for selecting at least one target sub-paragraph associated with the query content subjected to format processing from a plurality of answer segments based on the matching degree of the query content subjected to format processing and the answer segments.
According to another aspect of the present invention, there is provided a computer-readable storage medium, wherein the storage medium stores a computer program for executing the method according to any of the above embodiments.
According to another aspect of the present invention, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
the processor is used for reading the executable instructions from the memory and executing the instructions to realize the method of any one of the above embodiments.
According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program product including computer readable code, when the computer readable code runs on a device, a processor in the device executes a method for implementing any of the embodiments.
The innovation of the invention mainly comes from two points: firstly, answer screening is carried out on a given user question by integrating two similarity calculations, wherein the first step is mainly to match a text, the second step is to extract answer fragments based on a reading understanding algorithm from a candidate paragraph, and the reordering simultaneously considers two aspects of text matching and reading understanding; secondly, a unique loss function is adopted to train the matching degree of the problem and the candidate paragraphs.
The main advantages of the present invention result from the two innovative points described above. The reordering method can comprehensively consider the text matching degree of the user question and the candidate paragraph and the similarity between the user question and the accurate answer in the candidate paragraph, and the accuracy and the stability of answer screening are improved; the loss function used in the training of the first-step problem and candidate paragraph matching network can ensure that the paragraphs relevant to the problem can be accurately selected.
Drawings
A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:
FIG. 1 is a flow chart of a method for matching content based on matching degree according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method for understanding an algorithm based on multiple document re-ranking reads, according to an embodiment of the invention;
FIG. 3 is a model diagram of a multiple document re-ranking based reading understanding algorithm according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for matching content based on matching degree according to an embodiment of the present invention.
Detailed Description
Fig. 1 is a flowchart of a method for matching content based on matching degree according to an embodiment of the present invention. As shown in fig. 1, the method 100 includes:
step 101, obtaining original query content input by a user, and performing format processing on the original query content to obtain the query content subjected to format processing.
In one embodiment, formatting original query content to obtain formatted query content includes: acquiring a content processing rule for format processing of original query content; and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.
In one embodiment, before obtaining the original query content input by the user, the method further includes segmenting each document of the plurality of documents in the text content library according to the natural segment to obtain a plurality of natural segments; a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.
In one embodiment, further comprising, determining a number of characters in each text passage; determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed; and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.
Step 102, determining the matching degree of the query content subjected to format processing and the candidate paragraphs of each of the plurality of text paragraphs in the text content library, and determining the text paragraphs with the matching degree greater than a first matching degree threshold value as the candidate paragraphs.
In one embodiment, determining a degree of matching of the formatted query content to a candidate passage for each of a plurality of text passages within the textual content library includes:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature codes u of formatted query content query q
u q =Bert 1 (query)
Language characterization model Bert pre-trained using Bert 1 Determining each text paragraph p j Semantic feature coding of
Figure BDA0003830862700000091
Figure BDA0003830862700000092
Calculating the matching degree of the query content subjected to format processing and the candidate paragraph of the jth text paragraph in a plurality of text paragraphs in the text content library
Figure BDA0003830862700000093
Figure BDA0003830862700000094
Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.
In one embodiment, when determining that the formatted query content matches a candidate paragraph for each of a plurality of text paragraphs within the corpus of text, determining as the candidate paragraph the text paragraph for which the candidate paragraph matches greater than the first threshold of matching, the following penalty function is involved:
Figure BDA0003830862700000095
wherein, lambda is a hyper-parameter, omega - The document set is irrelevant to the query content query after format processing; omega + Is a collection of documents related to the formatted query content query.
In one embodiment, after determining a text passage for which the candidate passage has a degree of match greater than a first degree of match threshold as a candidate passage, the candidate passage is formed into a set of candidate passages:
Figure BDA0003830862700000096
step 103, selecting an answer segment associated with the query content subjected to format processing in each candidate paragraph, and determining the matching degree of the query content subjected to format processing and the answer segment of each answer segment.
In one embodiment, selecting an answer segment associated with the formatted query content in each candidate passage comprises:
language characterization model Bert pre-trained using Bert 2 Determining semantic feature encodings u for answer fragments associated with formatted query content qj
u qj =Bert 2 (concat(query,p j ))
Determining the starting position I of the answer segment in the candidate paragraph start And an end position I end
Figure BDA0003830862700000101
Figure BDA0003830862700000102
Figure BDA0003830862700000103
Figure BDA0003830862700000104
Wherein,
Figure BDA0003830862700000105
is a weight matrix of the starting position,
Figure BDA0003830862700000106
to be a weight matrix with the end position, softmax is an activation function, P start As starting position probability, P end To end position probability, len (p) j ) Is p j The character length of (2);
based on the starting position I start And an end position I end At each candidate paragraph p j To select an answer segment associated with the formatted query content.
In one embodiment, in selecting an answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:
L=αCE(Ps tart ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label start For the starting position of the standard answer Label, label end Label as the end position of the Standard answer Label span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.
And step 104, determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment.
In one embodiment, determining the degree of matching between the formatted query content and the answer segment of each answer segment comprises:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature coding of answer fragments for a jth candidate paragraph
Figure BDA0003830862700000111
Figure BDA0003830862700000112
Determining formatted query content u q Matching degree of answer segment with j-th answer segment
Figure BDA0003830862700000113
Figure BDA0003830862700000114
Wherein, a j Is the answer fragment of the jth candidate paragraph.
In one embodiment, determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment comprises:
degree of matching to answer segment
Figure BDA0003830862700000115
Performing logarithmic smoothing to obtain the matching degree after smoothing
Figure BDA0003830862700000116
Based on candidate paragraph matching degree
Figure BDA0003830862700000117
And degree of matching after smoothing
Figure BDA0003830862700000118
Determining the matching degree s of the query content subjected to format processing and the answer fragment:
Figure BDA0003830862700000119
where f is a log smoothing function.
And 105, selecting at least one target sub-paragraph associated with the format-processed query content from the plurality of answer segments based on the matching degree of the format-processed query content and the answer segments.
In one embodiment, selecting at least one target sub-paragraph from the plurality of answer segments that is associated with the formatted query content based on a degree of matching of the formatted query content to the answer segments comprises:
sorting the answer fragments according to the descending order of the matching degree of the query contents and the answer fragments after format processing so as to generate a sorted list;
acquiring a preset extraction parameter N, and selecting N answer fragments with the maximum matching degree from the sorted list;
and determining at least one answer segment with the matching degree larger than a second matching degree threshold value in the N answer segments with the maximum matching degree as a target subsection.
According to an alternative, the method comprises: step 1011, obtaining the original query content input by the user, and performing format processing on the original query content to obtain the query content subjected to format processing.
At step 1012, the matching degree of the formatted query content and the text of each text paragraph in the plurality of text paragraphs in the text content library is determined, and the text paragraphs with the matching degree greater than the first threshold matching degree are determined as candidate paragraphs.
Step 1013, selecting a result sub-paragraph associated with the format-processed query content in each candidate paragraph, and determining a result matching degree of the format-processed query content and each result sub-paragraph.
And 1014, determining the matching degree of the query content subjected to format processing and the result subsection based on the text matching degree and the result matching degree.
Step 1015, selecting at least one target sub-paragraph from the plurality of result sub-paragraphs that is associated with the formatted query content based on the matching degree between the formatted query content and the result sub-paragraphs.
The format processing of the original query content to obtain the format-processed query content includes:
acquiring a content processing rule for format processing of original query content;
and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.
Also included prior to obtaining the original query content entered by the user,
segmenting each document in a plurality of documents in a text content library according to a natural segment to obtain a plurality of natural segments;
a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.
Further comprising, determining the number of characters in each text paragraph;
determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed;
and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.
Determining a text matching degree between the query content subjected to format processing and each text passage in a plurality of text passages in the text content library comprises the following steps:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature codes u of format-processed query content query q
u q =Bert 1 (query)
Language characterization model Bert pre-trained using Bert 1 Determining each text paragraph p j By semantic feature coding
Figure BDA0003830862700000131
Figure BDA0003830862700000132
Calculating the text matching degree of the query content subjected to format processing and the jth text paragraph in a plurality of text paragraphs in the text content library
Figure BDA0003830862700000133
Figure BDA0003830862700000134
Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.
When determining the text matching degree of the query content subjected to format processing and each text passage in a plurality of text passages in the text content library, and determining the text passage with the text matching degree larger than a first matching degree threshold value as a candidate passage, the following loss functions are involved:
Figure BDA0003830862700000135
wherein, λ is a hyper-parameter, Ω - The document set is irrelevant to the query content query after format processing; omega + Is a collection of documents that are relevant to the formatted query content query.
After determining the text paragraphs with the text matching degree larger than the first matching degree threshold value as candidate paragraphs, forming the candidate paragraphs into a candidate paragraph set:
Figure BDA0003830862700000136
selecting a result sub-paragraph associated with the formatted query content in each of the candidate paragraphs, comprising:
language characterization model Bert pre-trained using Bert 2 Determining semantic feature encodings u for result sub-paragraphs associated with formatted query content qj
u qj =Bert 2 (concat(query,p j ))
Determining a starting position I where a result subsection falls within a candidate paragraph start And an end position I end
Figure BDA0003830862700000141
Figure BDA0003830862700000142
Figure BDA0003830862700000143
Figure BDA0003830862700000144
Wherein,
Figure BDA0003830862700000145
is a weight matrix of the starting position,
Figure BDA0003830862700000146
to weight matrix with end position, softmax is the activation function, P start As starting position probability, P end To end position probability, len (p) j ) Is p j The character length of (d);
based on the starting position I start And an end position I end At each candidate paragraph p j Selects a result sub-paragraph associated with the formatted query content.
In selecting the result sub-paragraph associated with the formatted query content in each candidate paragraph, the following penalty function is involved:
L=αCE(Ps tart ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label start As the starting position of the standard answer Label, label end Label as the end position of the Standard answer Label span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.
In one embodiment, filtering the candidate answers (e.g., filtering the candidate answers or answer snippets in the candidate passage) is also included. In particular, the starting position I at which the resulting sub-segment falls within the candidate paragraph is determined according to the above start And an end position I end The formula (a) determines the matching degree of the candidate answer (or determines the score, the matching value, the matching score and the like of the candidate answer), the matching degree (the score, the matching value, the matching score and the like) takes the average value of the probabilities of the starting position and the ending position, and the relevant candidate answer set meeting a specific threshold value t2 is filtered and reserved.
In one embodiment, the method further comprises the following steps of predicting the association probability of the query content query and the candidate answer: and after the answer segments of the candidate paragraphs are predicted, the association probability between the query content query and the candidate answer segments is predicted.
Determining semantic feature encodings (e.g., a language characterization model Bert pre-trained using Bert) 1 Determining global semantic feature coding and character level semantic feature coding):
[H cls ,H tokens ]=Bert 3 ([query,answer])
wherein H cls Encoding for Global semantic features (Global Embedding);
H tokens encoding semantic features at the character level (token-level Embedding);
the answer is a candidate answer which is taken from a candidate answer set;
the query is the query content.
In one embodiment, further comprising, performing decomposition of a feature hierarchy (a feature hierarchy of a model for predicting the probability of association of query content query with candidate answer): the features are hierarchically decomposed into an intent layer, a core entity layer, and a relationship layer. The core entity layer carries out Mask on the query content query and the character codes of the non-core entities in the candidate answer; the relation layer carries out Mask on character codes of core entities in the query content query and the candidate answer; the intent layer retains the full character encoding. On the basis, the three layers of character string codes are subjected to matrix transformation, and are respectively expressed as follows after being processed by an average pooling layer:
h i is an intention layer; h is a total of ce Is a core entity layer; h is r Is a relation layer;
Figure BDA0003830862700000151
determining the probability distribution of the hierarchical features:
Figure BDA0003830862700000152
where a is the activation function and where a is the activation function,
Figure BDA0003830862700000153
is a trainable parameter; y is u Labels that are probability distributions;
determining a global feature probability distribution:
Figure BDA0003830862700000154
wherein
Figure BDA0003830862700000155
Is a trainable parameter; y is g The query is a label of global probability distribution, the query is query content, query sentences or query keywords, and the answer is a candidate answer;
determining a loss function (the loss function in predicting the probability of association of a query with a candidate answer) includes:
determining a global loss function:
L G =-logP(y g |query,answer)
determining a distribution difference loss function:
L D =F(P(y u |query,answer),P(y g |query,answer))+F(P(y g |query,answer),P(y u |query,answer))
wherein
Figure BDA0003830862700000161
Determining a loss function:
L=L G +λL D
wherein λ is a hyperparameter
Prediction of association probability:
Figure BDA0003830862700000162
reorder to obtain answers (answers matching or corresponding to query content query)For example, selecting topN answers from the candidate answers): according to the loss function as the matching degree of the query content query and the candidate answer (for example,
Figure BDA0003830862700000163
) And reordering all answer candidate sets, and selecting the answer of topN (N is a natural number, so that the N answers with the largest matching degree are selected) as the final result.
In one embodiment, determining the result matching degree of the query content subjected to format processing and each result sub-paragraph comprises:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature encodings for a result sub-paragraph of a jth candidate paragraph
Figure BDA0003830862700000164
Figure BDA0003830862700000165
Determining format-processed query content u q Degree of matching with the result of the jth result sub-paragraph
Figure BDA0003830862700000166
Figure BDA0003830862700000167
Wherein, a j Is the result sub-paragraph of the jth candidate paragraph.
Determining the matching degree of the query content subjected to format processing and the result subsection based on the text matching degree and the result matching degree, wherein the step comprises the following steps:
degree of matching of results
Figure BDA0003830862700000171
Performing logarithmic smoothing to obtain the matching degree after the smoothing
Figure BDA0003830862700000172
Based on text matching degree
Figure BDA0003830862700000173
And degree of matching by smoothing
Figure BDA0003830862700000174
Determining the matching degree s of the query content subjected to format processing and the result sub-section:
Figure BDA0003830862700000175
where f is a log smoothing function.
Selecting at least one target sub-paragraph from the plurality of result sub-paragraphs that is associated with the formatted query content based on a degree of matching of the formatted query content to the result sub-paragraph, including:
sorting the result sub-paragraphs according to the descending order of the matching degree of the query contents and the result sub-paragraphs after format processing so as to generate a sorted list;
acquiring a preset extraction parameter N, and selecting N result subsections with the maximum matching degree from the sorted list;
and determining at least one result sub-paragraph of the N result sub-paragraphs with the maximum matching degree, wherein the matching degree of the at least one result sub-paragraph is greater than a second matching degree threshold value, as the target sub-paragraph.
FIG. 2 is a flow diagram of a method for understanding an algorithm based on multiple document re-ranking reads in accordance with an embodiment of the present invention.
Typically, document pre-processing is performed first. Firstly, the natural segments of all candidate documents are subjected to preliminary segmentation. Since the header contains clear service information and is highly related to the problem, the multi-level header and each section of content are separately connected together by special symbols, and if the obtained result does not exceed the preset maximum length, the result is taken as the result of preprocessing; otherwise, further cutting is carried out. And finally, obtaining a paragraph candidate set of the multiple documents:
Figure BDA0003830862700000176
then, the query is determined to match the candidate paragraph relevancy. As shown in FIG. 2, query and paragraphs _1, paragraph _2, \ 8230, and paragraph N are encoded at the semantic coding layer. For example, determining semantic feature encoding: semantic coding is respectively carried out on the query and the full candidate paragraphs, a Bert pre-trained language representation model is adopted, and the coding result is expressed as follows:
u q =Bert 1 (query)
Figure BDA0003830862700000181
determining matching degree prediction of query and candidate paragraphs:
Figure BDA0003830862700000182
determining a loss function:
Figure BDA0003830862700000183
wherein λ represents a hyper-parameter; omega _ Representing a set of candidate documents unrelated to the query; omega + Representing a set of candidate documents associated with the query.
And (3) filtering candidate paragraphs: according to the formula of step 2
Figure BDA0003830862700000184
Score the candidate paragraphs (e.g., determine a match value, match score, etc. for the candidate paragraphs), and filter the set of relevant candidate paragraphs that remain satisfying threshold t1, represented as
Figure BDA0003830862700000185
Next, at the semantic matching and answer extraction layer, the answer segments of the candidate paragraphs are predicted. As shown in FIG. 2, paragraph _1, paragraph _2, paragraph _ 8230, paragraph \8230, matching pair of paragraph N and query are constructed, and then paragraph _1, paragraph _2, paragraph \8230, matching pair of paragraph N and answer are constructed. And, finally, determining paragraph _1, paragraph _2, \8230;, matching pairs of paragraph N and answer, and degree of matching of query.
Semantic feature coding: semantic coding is carried out on the query and the related candidate paragraphs, a Bert pre-trained language representation model is still adopted, and the coding result is expressed as:
u qj =Bert 2 (concat(query,p j ))
answer start and stop site prediction:
Figure BDA0003830862700000186
Figure BDA0003830862700000187
Figure BDA0003830862700000188
Figure BDA0003830862700000189
wherein W s And W e A weight matrix for the start position and the end position, respectively.
Loss function:
L=αCE(P start ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label is the standard answer Label, and span represents the segment from the start to the end position.
At the reordering layer, the optimal answer is obtained by reordering:
predicting matching degree of query and answer:
Figure BDA0003830862700000191
Figure BDA0003830862700000192
wherein a is j Is the answer fragment of the candidate paragraph j predicted in step 2.
And (3) predicting a final answer by combining the matching degree of the query and the candidate paragraphs and the matching degree of the query and the answer:
Figure BDA0003830862700000193
where f is a log smoothing function.
At the answer prediction layer, sorting answers in the candidate paragraphs according to s, and returning a result that topN is selected and s satisfies a condition larger than a threshold value t 2.
Fig. 3 is a model diagram of a multiple-document re-ranking based reading understanding algorithm according to an embodiment of the present invention. During the operation of the model based on the multi-document reordering reading understanding algorithm, the following contents are realized:
step one, preprocessing a document.
Firstly, the preliminary segmentation is carried out according to the natural segments of all candidate documents. Since the header contains clear service information and is highly related to the problem, the multi-level header and each section of content are separately connected together by special symbols, and if the obtained result does not exceed the preset maximum length, the result is taken as the result of preprocessing; otherwise, further cutting is carried out. And finally, obtaining a paragraph candidate set of the multiple documents:
Figure BDA0003830862700000194
and (II) building a model and determining an answer corresponding to the query content by using the model.
1. Determining the association degree matching of the query and the candidate paragraphs:
(1.1) carrying out semantic feature coding: semantic coding is respectively carried out on the query and the full candidate paragraphs, a Bert pre-trained language representation model is adopted, and the coding result is expressed as follows:
u q =Bert 1 (query)
Figure BDA0003830862700000201
(1.2) determining the matching degree prediction of the query and the candidate paragraph:
Figure BDA0003830862700000202
(1.3) determining a loss function:
Figure BDA0003830862700000203
wherein λ represents a hyper-parameter; omega - Representing a set of candidate documents that are not related to the query;
Ω + representing a set of candidate documents associated with the query.
(1.4) determining filtering candidate paragraphs: determining the degree of matching of the candidate paragraphs according to the formula of step (1.2), for example by scoring the candidate paragraphs to identify the degree of matching, and filtering out the relevant set of candidate paragraphs that remain satisfying a certain threshold t1, denoted as
Figure BDA0003830862700000204
2. Predicting answer segments of candidate paragraphs:
and (2.1) carrying out semantic feature coding: semantic coding is carried out on the query and the related candidate paragraphs, a Bert pre-trained language characterization model is still adopted, and the coding result is expressed as follows:
u qj =Bert 2 (concat(query,p j ))
(2.2) predicting the start site (or position) and the end site (or position) of the answer:
Figure BDA0003830862700000205
Figure BDA0003830862700000206
Figure BDA0003830862700000207
Figure BDA0003830862700000208
wherein
Figure BDA0003830862700000209
And
Figure BDA00038308627000002010
weight matrices for the start and end positions, respectively.
(2.3) determining a loss function:
L=αCE(P start ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label is the standard answer Label, and span represents the segment from the start to the end position.
(2.4) filtering candidate answers: determining a score (e.g., a score of a degree of match) of the candidate answer according to step (2.2), the score being an average of the starting position probability and the ending position probability, and filtering the relevant candidate answer set that remains to satisfy the threshold t 2.
3. Predicting the association probability of the query and the candidate answer: (e.g., predicting answer fragments for candidate paragraphs followed by predicting association probabilities between query and candidate answer fragments)
(3.1) semantic feature coding:
[H cls ,H tokens ]=Bert 3 ([query,answer])
wherein H cls Encoding for Global semantic features (Global Embedding);
H tokens encoding semantic features at the character level (token-level Embedding);
answer is a candidate answer which is taken from a candidate answer set;
the query is the query content.
(3.2) feature-hierarchical (e.g., feature-hierarchical of a model for predicting the probability of association of a query with a candidate answer) decomposition: the features are hierarchically decomposed into an intent layer, a core entity layer, and a relationship layer. Wherein, the character codes of the non-core entity in the core entity layer mask query and the answer; character coding of core entities in the relation layer mask query and answer; the intent layer retains the full character encoding. On the basis, the three layers of character string codes are subjected to matrix transformation, and are respectively expressed as follows after being processed by an average pooling layer:
h i is an intention layer; h is ce Is a core entity layer; h is a total of r Is a relation layer;
Figure BDA0003830862700000211
(3.3) joint probability distribution of hierarchical features:
Figure BDA0003830862700000212
where σ is the activation function, W 1 Is a trainable parameter; y is u Labels for joint probability distribution
(3.4) global feature probability distribution:
Figure BDA0003830862700000213
wherein W 2 Is a trainable parameter; y is g Labels that are global probability distributions
(3.5) determining a loss function (the loss function in predicting the association probability of the query with the candidate answer):
(3.5.1) global penalty function:
L G =-logP(y g |query,answer)
(3.5.2) distribution variance loss function:
L D =F(P(y u |query,answer),P(y g |query,answer))+F(P(y g |query,answer),P(y u |query,answer))
wherein
Figure BDA0003830862700000221
(3.6) joint loss function:
L=L G +λL D
wherein λ is a hyperparameter
(3.7) associated probability prediction:
Figure BDA0003830862700000222
and reordering to obtain answers: and (3.7) reordering all answer candidate sets according to the matching degree of the query and the answer, and selecting the answer of topN as a final result. Wherein N is a natural number, such as 5, 10, etc.
Fig. 4 is a schematic structural diagram of an apparatus for matching content based on matching degree according to an embodiment of the present invention. The apparatus 400 comprises: a processing unit 401, a first determining unit 402, a second determining unit 403, a third determining unit 404, and a selecting unit 405.
The processing unit 401 is configured to obtain an original query content input by a user, and perform format processing on the original query content to obtain a query content subjected to format processing. The processing unit 401 is specifically configured to obtain a content processing rule for performing format processing on original query content; and performing format processing on the original query content based on the content processing rule to obtain the query content subjected to format processing.
The system also comprises a preprocessing unit, a searching unit and a searching unit, wherein the preprocessing unit is used for segmenting each document in a plurality of documents in the text content library according to the natural segments to obtain a plurality of natural segments; a plurality of levels of headings in each document are determined, and each level of heading and at least one natural segment associated with the heading are formed into a text paragraph. The preprocessing unit is further used for determining the number of characters in each text paragraph; determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed; and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is less than or equal to the character number threshold value.
A first determining unit 402, configured to determine a matching degree between the format-processed query content and a candidate paragraph in each of a plurality of text paragraphs in the text content library, and determine a text paragraph with the matching degree greater than a first matching degree threshold as a candidate paragraph.
The first determination unit 402 is specifically configured to use a language representation model Bert pre-trained with Bert 1 Determining semantic feature codes u of formatted query content query q
u q =Bert 1 (query)
Language characterization model Bert pre-trained using Bert 1 Determining each text paragraph p j By semantic feature coding
Figure BDA0003830862700000231
Figure BDA0003830862700000232
Calculating the matching degree of the query content subjected to format processing and the candidate paragraph of the jth text paragraph in the plurality of text paragraphs in the text content library
Figure BDA0003830862700000233
Figure BDA0003830862700000234
Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.
The first determining unit 402 is specifically configured to, when determining that the query content subjected to format processing matches with a candidate paragraph of each text paragraph in a plurality of text paragraphs in the text content library, and determining a text paragraph whose candidate paragraph matching degree is greater than a first matching degree threshold as a candidate paragraph, involve the following loss functions:
Figure BDA0003830862700000235
wherein, lambda is a hyper-parameter, omega - A set of documents that are irrelevant to the query content query after format processing; omega + Is a collection of documents that are relevant to the formatted query content query.
The first determining unit 402 is specifically configured to, after determining that a text passage with a candidate passage matching degree greater than a first matching degree threshold is a candidate passage, form the candidate passage into a candidate passage set:
Figure BDA0003830862700000247
a second determining unit 403, configured to select an answer segment associated with the format-processed query content in each candidate paragraph, and determine a matching degree between the format-processed query content and the answer segment of each answer segment.
The second determination unit 403 is specifically configured to use the pre-trained language characterization model Bert of Bert 2 Determining semantic feature encodings u for answer fragments associated with formatted query content qj
u qj =Bert 2 (concat(query,p j ))
Determining the starting position I of the answer segment in the candidate paragraph start And an end position I end
Figure BDA0003830862700000241
Figure BDA0003830862700000242
Figure BDA0003830862700000243
Figure BDA0003830862700000244
Wherein,
Figure BDA0003830862700000245
is a weight matrix of the starting position,
Figure BDA0003830862700000246
to weight matrix with end position, softmax is the activation function, P start As starting position probability, P end To end position probability, len (p) j ) Is p j The character length of (2);
based on the starting position I start And an end position I end At each candidate paragraph p j To select an answer segment associated with the formatted query content.
In selecting an answer segment associated with the formatted query content in each candidate passage, the following loss function is involved:
L=αCE(Ps tart ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
wherein CE representsCross entropy loss function, label start As the starting position of the standard answer Label, label end Label as the end position of the standard answer Label span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.
A third determining unit 404, configured to determine a matching degree between the query content subjected to format processing and the answer segment based on the matching degree between the candidate paragraphs and the matching degree between the answer segments.
The third determining unit 404 is specifically configured to:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature coding of answer fragments for a jth candidate paragraph
Figure BDA0003830862700000251
Figure BDA0003830862700000252
Determining format-processed query content u q Matching degree of answer segment with j-th answer segment
Figure BDA0003830862700000253
Figure BDA0003830862700000254
Wherein, a j Is the answer segment of the jth candidate paragraph.
The third determining unit 404 is specifically configured to determine the matching degree of the answer segment
Figure BDA0003830862700000255
Performing logarithmic smoothing to obtain the matching degree after the smoothing
Figure BDA0003830862700000256
Based on candidate paragraph matching degree
Figure BDA0003830862700000257
And degree of matching after smoothing
Figure BDA0003830862700000258
Determining the matching degree s of the query content subjected to format processing and the answer fragment:
Figure BDA0003830862700000259
where f is a log smoothing function.
A selecting unit 405, configured to select at least one target sub-paragraph associated with the format-processed query content from the multiple answer segments based on a matching degree between the format-processed query content and the answer segments.
The selecting unit 405 is specifically configured to sort the answer fragments according to a descending order of matching degrees of the query content and the answer fragments subjected to format processing, so as to generate a sorted list;
acquiring a preset extraction parameter N, and selecting N answer segments with the maximum matching degree from the sorted list;
and determining at least one answer segment with the matching degree larger than a second matching degree threshold value in the N answer segments with the maximum matching degree as a target subsection.

Claims (15)

1. A method for content matching based on a degree of matching, the method comprising:
acquiring original query content input by a user, and performing format processing on the original query content to acquire the query content subjected to format processing;
determining the matching degree of the candidate paragraphs of the query content subjected to format processing and each of a plurality of text paragraphs in a text content library, and determining the text paragraphs with the matching degree of the candidate paragraphs greater than a first matching degree threshold value as candidate paragraphs;
selecting an answer segment associated with the format-processed query content in each candidate paragraph, and determining the matching degree of the format-processed query content and the answer segment of each answer segment;
determining the matching degree of the query content subjected to format processing and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment; and
selecting at least one target sub-paragraph associated with the formatted query content from a plurality of answer segments based on a degree of matching of the formatted query content to an answer segment.
2. The method of claim 1, the formatting the original query content to obtain formatted query content, comprising:
acquiring a content processing rule for performing format processing on original query content;
and performing format processing on the original query content based on a content processing rule to obtain the query content subjected to format processing.
3. The method of claim 1, further comprising, prior to obtaining original query content entered by a user,
segmenting each document in the plurality of documents in the text content library according to a natural segment to obtain a plurality of natural segments;
a plurality of levels of headings in each document are determined, and each level of headings and at least one natural segment associated with the headings are formed into a text paragraph.
4. The method of claim 3, further comprising,
determining the number of characters in each text paragraph;
determining the text paragraphs with the number of characters larger than the character number threshold value as text paragraphs to be processed;
and segmenting the text paragraphs to be processed until the number of characters of any text paragraphs obtained through segmentation is smaller than or equal to a character number threshold value.
5. The method of claim 1, the determining a degree of matching of the formatted query content to candidate paragraphs for each of a plurality of text paragraphs within a textual content library, comprising:
language characterization model Bert pre-trained using Bert 1 Determining semantic feature codes u of the formatted query content query q
u q =Bert 1 (query)
Language characterization model Bert pre-trained using Bert 1 Determining each text paragraph p j Semantic feature coding of
Figure FDA0003830862690000021
Figure FDA0003830862690000022
Calculating the matching degree of the query content after format processing and the candidate paragraph of the jth text paragraph in the plurality of text paragraphs in the text content library
Figure FDA0003830862690000023
Figure FDA0003830862690000024
Wherein 0< -j is less than or equal to na, j is a natural number, and na is the number of text paragraphs in the text content library.
6. The method of claim 5, wherein in determining that the formatted query content matches a candidate passage for each of a plurality of passages of text within the library of contents, determining as the candidate passage a passage having a candidate passage matching greater than a first threshold of matching, involves:
Figure FDA0003830862690000025
wherein, lambda is a hyper-parameter, omega - A set of documents that are irrelevant to the query content query after format processing; omega + Is a collection of documents related to the formatted query content query.
7. The method of claim 5, after determining a text passage for which the candidate passage match is greater than the first match threshold as the candidate passage, forming the candidate passage into a candidate passage set:
Figure FDA0003830862690000031
8. the method of claim 1, the selecting, in each candidate passage, an answer segment associated with the formatted query content, comprising:
language characterization model Bert pre-trained using Bert 2 Determining semantic feature encodings u for answer fragments associated with the formatted query content qj
u qj =Bert 2 (concat(query,p j ))
Determining the starting position I of the answer segment in the candidate paragraph start And an end position I end
Figure FDA0003830862690000032
Figure FDA0003830862690000033
Figure FDA0003830862690000034
Figure FDA0003830862690000035
Wherein,
Figure FDA0003830862690000036
is a weight matrix of the starting position,
Figure FDA0003830862690000037
to be a weight matrix with the end position, softmax is an activation function, P start As starting position probability, P end To end position probability, len (p) j ) Is p j The character length of (2);
based on the starting position I start And an end position I end At each candidate paragraph p j To select an answer segment associated with the formatted query content.
9. The method of claim 1, when selecting an answer segment associated with the formatted query content in each candidate passage, involving the following loss function:
L=αCE(P start ,Label start )+βCE(P end ,Label end )+γCE(P span ,Label span )
where CE represents the cross entropy loss function, label start As the starting position of the standard answer Label, label end Label as the end position of the Standard answer Label span An answer segment representing a standard answer label from a start position to an end position; α, β, γ are hyper-parameters.
10. The method of claim 1, determining a degree of match of the formatted query content with an answer segment of each answer segment, comprising:
language table pre-trained using BertMark model Bert 1 Determining semantic feature coding of answer fragments for the jth candidate paragraph
Figure FDA0003830862690000041
Figure FDA0003830862690000042
Determining the format-processed query content u q Matching degree of answer segment with j-th answer segment
Figure FDA0003830862690000043
Figure FDA0003830862690000044
Wherein, a j Is the answer fragment of the jth candidate paragraph.
11. The method of claim 10, wherein determining the matching degree of the formatted query content and the answer segment based on the matching degree of the candidate paragraphs and the matching degree of the answer segment comprises:
matching degree of the answer segment
Figure FDA0003830862690000045
Performing logarithmic smoothing to obtain the matching degree after smoothing
Figure FDA0003830862690000046
Based on candidate paragraph matching degree
Figure FDA0003830862690000047
And degree of matching by smoothing
Figure FDA0003830862690000048
Determining the matching degree s of the query content subjected to format processing and the answer fragment:
Figure FDA0003830862690000049
where f is a log smoothing function.
12. The method of claim 11, selecting at least one target sub-paragraph from a plurality of answer segments associated with the formatted query content based on a degree of matching of the formatted query content to an answer segment, comprising:
sorting the answer fragments according to the descending order of the matching degree of the query contents and the answer fragments after format processing so as to generate a sorted list;
acquiring a preset extraction parameter N, and selecting N answer segments with the maximum matching degree from the sorted list;
and determining at least one answer segment with the matching degree larger than a second matching degree threshold value in the N answer segments with the maximum matching degree as a target subsection.
13. An apparatus for content matching based on a degree of matching, the apparatus comprising:
the processing unit is used for acquiring original query content input by a user and carrying out format processing on the original query content to acquire the query content subjected to format processing;
a first determining unit, configured to determine a matching degree between the query content subjected to format processing and a candidate paragraph of each text paragraph in a plurality of text paragraphs in a text content library, and determine a text paragraph with the matching degree greater than a first matching degree threshold as a candidate paragraph;
a second determining unit, configured to select, in each candidate paragraph, an answer segment associated with the query content subjected to format processing, and determine a matching degree between the query content subjected to format processing and the answer segment of each answer segment;
a third determining unit, configured to determine, based on the candidate paragraph matching degree and the answer segment matching degree, a matching degree between the query content subjected to format processing and the answer segment; and
and the selecting unit is used for selecting at least one target sub-paragraph associated with the query content subjected to format processing from a plurality of answer segments based on the matching degree of the query content subjected to format processing and the answer segments.
14. A computer-readable storage medium, characterized in that the storage medium stores a computer program for performing the method of any of claims 1-12.
15. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor to read the executable instructions from the memory and execute the instructions to implement the method of any of claims 1-12.
CN202211074234.1A 2022-09-02 2022-09-02 Intelligent question-answering system for content matching based on matching degree Active CN115470332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211074234.1A CN115470332B (en) 2022-09-02 2022-09-02 Intelligent question-answering system for content matching based on matching degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211074234.1A CN115470332B (en) 2022-09-02 2022-09-02 Intelligent question-answering system for content matching based on matching degree

Publications (2)

Publication Number Publication Date
CN115470332A true CN115470332A (en) 2022-12-13
CN115470332B CN115470332B (en) 2023-03-31

Family

ID=84368655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211074234.1A Active CN115470332B (en) 2022-09-02 2022-09-02 Intelligent question-answering system for content matching based on matching degree

Country Status (1)

Country Link
CN (1) CN115470332B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200210489A1 (en) * 2018-12-27 2020-07-02 International Business Machines Corporation Extended query performance prediction framework utilizing passage-level information
CN111460089A (en) * 2020-02-18 2020-07-28 北京邮电大学 Multi-paragraph reading understanding candidate answer sorting method and device
CN112163079A (en) * 2020-09-30 2021-01-01 民生科技有限责任公司 Intelligent conversation method and system based on reading understanding model
CN112417105A (en) * 2020-10-16 2021-02-26 泰康保险集团股份有限公司 Question and answer processing method and device, storage medium and electronic equipment
CN113449754A (en) * 2020-03-26 2021-09-28 百度在线网络技术(北京)有限公司 Method, device, equipment and medium for training and displaying matching model of label

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200210489A1 (en) * 2018-12-27 2020-07-02 International Business Machines Corporation Extended query performance prediction framework utilizing passage-level information
CN111460089A (en) * 2020-02-18 2020-07-28 北京邮电大学 Multi-paragraph reading understanding candidate answer sorting method and device
CN113449754A (en) * 2020-03-26 2021-09-28 百度在线网络技术(北京)有限公司 Method, device, equipment and medium for training and displaying matching model of label
CN112163079A (en) * 2020-09-30 2021-01-01 民生科技有限责任公司 Intelligent conversation method and system based on reading understanding model
CN112417105A (en) * 2020-10-16 2021-02-26 泰康保险集团股份有限公司 Question and answer processing method and device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YOHAN KIM等: "Question answering method for infrastructure damage information retrieval from textual data using bidirectional encoder representations from transformers", 《ELSEVIER》 *
黄永: "学术文本的结构功能识别——基于段落的识别", 《情报学报》 *

Also Published As

Publication number Publication date
CN115470332B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN109271505B (en) Question-answering system implementation method based on question-answer pairs
CN111639171B (en) Knowledge graph question-answering method and device
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN110188272B (en) Community question-answering website label recommendation method based on user background
CN110597735A (en) Software defect prediction method for open-source software defect feature deep learning
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN116342167B (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN111858842A (en) Judicial case screening method based on LDA topic model
CN115982338A (en) Query path ordering-based domain knowledge graph question-answering method and system
CN114048354A (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN117009521A (en) Knowledge-graph-based intelligent process retrieval and matching method for engine
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
CN114970563A (en) Chinese question generation method and system fusing content and form diversity
CN111104503A (en) Construction engineering quality acceptance standard question-answering system and construction method thereof
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN112667819A (en) Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device
CN115470332B (en) Intelligent question-answering system for content matching based on matching degree
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN115238705A (en) Semantic analysis result reordering method and system
CN115204519A (en) Knowledge information fused domain patent quality grade prediction method
CN115017260A (en) Keyword generation method based on subtopic modeling
CN114580556A (en) Method and device for pre-evaluating patent literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant