CN113779218B

CN113779218B - Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium

Info

Publication number: CN113779218B
Application number: CN202111051968.3A
Authority: CN
Inventors: 朱前威; 谢春禾
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2023-10-27
Anticipated expiration: 2041-09-08
Also published as: CN113779218A

Abstract

The application relates to a question-answer pair construction method, a question-answer pair construction device, computer equipment and a storage medium. The method comprises the following steps: splitting a document into paragraphs; judging whether a paragraph with a title and a text coexisting exists in the split paragraphs; if there is a paragraph with the title and the text coexisting, dividing the paragraph with the title and the text into different paragraphs according to the title and the text respectively; according to all paragraphs in the document, question-answer pairs in the document are constructed. The paragraphs with the coexisting titles and texts in the document can be segmented and divided into different paragraphs, so that the titles and the texts in the same paragraph can be identified and constructed as question-answer pairs, and the application range is wider. In addition, the construction of question-answer pairs can be realized without content deletion, so that the accuracy of subsequent automatic answer is improved.

Description

Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method and apparatus for constructing question-answer pairs, a computer device, and a storage medium.

Background

With the application of knowledge graphs and intelligent customer service in various industries, the use of information extraction technology to mine knowledge from documents has become a research hotspot. The automatic obtaining of question and answer pairs from a document is always a recognized difficulty, the question and answer pairs refer to question text and answer text matched with the question text, and the document can be a product specification document or a regulation document. In the actual implementation process, the question-answer pairs are automatically acquired from the document mainly by extracting the titles in the document.

In the related art, a document is divided into paragraphs according to the line feed characteristics of the paragraphs in the document, the titles in the divided paragraphs are determined, the text paragraphs under the titles are determined, the titles are finally used as question texts in question-answer pairs, and the text paragraphs under the titles are used as answer texts in the question-answer pairs. Because the title and the text may appear in the same paragraph, the situation cannot be identified according to the line feed characteristics of the paragraphs in the document, so that the title and the text in the same paragraph cannot be formed into question-answer pairs, the construction of the question-answer pairs has content deletion, and the accuracy of subsequent automatic replies is further affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a question-answer pair construction method, apparatus, computer device, and storage medium capable of improving the accuracy of construction of question-answer pairs.

A question-answer pair construction method comprises the following steps:

splitting a document into paragraphs;

judging whether a paragraph with a title and a text coexisting exists in the split paragraphs;

if there is a paragraph with the title and the text coexisting, dividing the paragraph with the title and the text into different paragraphs according to the title and the text respectively;

According to all paragraphs in the document, question-answer pairs in the document are constructed.

In one embodiment, splitting a document into paragraphs includes:

if the document is a text file, splitting the document into paragraphs according to paragraph identifiers in the document;

if the document is a text image, character recognition is carried out on the document, the position information of each character in the document is determined, the position information of each text line in the document is determined according to the position information of each character, and different text lines are combined according to the position information of each text line, so that paragraphs in the document are obtained.

In one embodiment, constructing question-answer pairs in a document from all paragraphs in the document includes:

selecting paragraphs meeting a first preset condition from all the paragraphs, wherein the first preset condition is used for measuring the possibility degree of the paragraphs being the title paragraphs as candidate title paragraphs;

and constructing question-answer pairs in the document according to all the candidate title paragraphs.

In one embodiment, the first preset condition includes at least one of the following conditions: the paragraph sentence length is a first preset threshold value, the total word number of the paragraphs is smaller than a second preset threshold value, the total punctuation number of the paragraphs is smaller than a third preset threshold value, and the paragraph format satisfies the preset format.

In one embodiment, constructing question-answer pairs in a document from all candidate heading paragraphs includes:

determining a title paragraph from all candidate title paragraphs;

determining a hierarchical structure of the document according to the title paragraph and the frame template, wherein the hierarchical structure is used for representing the hierarchical relationship between hierarchical titles in the document;

according to the hierarchical structure, question-answer pairs in the document are constructed.

In one embodiment, determining a title paragraph from all candidate title paragraphs includes:

obtaining a semantic association score of each candidate title paragraph, wherein the semantic association score is used for representing the semantic association degree between text content in the candidate title paragraph and text content in a scope of the candidate title paragraph, and the scope is used for representing the paragraph range covered by the candidate title paragraph;

and selecting the candidate title paragraphs with semantic association scores meeting the second preset condition from all the candidate title paragraphs as title paragraphs.

In one embodiment, the semantic association score comprises at least one of the following scores, the second preset condition being determined by the following sub-conditions;

the following scores were respectively: word co-occurrence scores, paragraph semantic scores, and sentence semantic scores; the following sub-conditions include: the word co-occurrence score is greater than a fourth preset threshold, the paragraph semantic score is greater than a fifth preset threshold, and the sentence semantic score is greater than a sixth preset threshold; and the score types included in the semantic association scores are matched with the sub-conditions included in the second preset conditions.

In one embodiment, the semantic association scores comprise word co-occurrence scores; accordingly, obtaining the semantic association score for each candidate title paragraph includes:

for any candidate title paragraph, calculating word segmentation similarity between each word segmentation in any candidate title paragraph and each word segmentation in the scope of any candidate title paragraph;

determining the word co-occurrence score of each word segment in any candidate title segment according to the similarity of all the word segments corresponding to each word segment in any candidate title segment;

and determining the word co-occurrence score of any candidate title according to the word co-occurrence score of each word in any candidate title.

In one embodiment, determining the word co-occurrence score for any candidate heading segment based on the word co-occurrence score for each segmented word in any candidate heading segment includes:

selecting word segments with word co-occurrence scores greater than a seventh preset threshold value from any candidate title segment;

summing the word co-occurrence scores corresponding to each screened word to obtain a sum value;

and calculating the ratio between the sum value and the total word score in any candidate title section, and taking the ratio as the word co-occurrence score of any candidate title section.

In one embodiment, the semantic association scores comprise paragraph semantic scores; accordingly, obtaining the semantic association score for each candidate title paragraph includes:

for any candidate title paragraph, calculating the paragraph similarity between any candidate title paragraph and each paragraph in the scope of any candidate title paragraph;

and determining the paragraph semantic score of any candidate title according to the similarity of each paragraph corresponding to any candidate title.

In one embodiment, the semantic association scores comprise sentence semantic scores; accordingly, obtaining the semantic association score for each candidate title paragraph includes:

for any candidate heading paragraph, calculating the sentence similarity between each sentence in any candidate heading paragraph and each sentence in the scope of any candidate heading paragraph;

and determining the sentence semantic score of any candidate heading section according to the sentence similarity corresponding to each sentence in any candidate heading section.

and screening out paragraphs meeting a third preset condition from the split paragraphs, wherein the third preset condition is set based on the catalog title characteristics as title paragraphs.

In one embodiment, before determining the hierarchical structure of the document according to the title paragraph and the frame template, the method further comprises:

determining the hierarchical title type corresponding to the title item in each title paragraph, and counting the total occurrence times of each hierarchical title type;

and screening out the title paragraphs corresponding to the hierarchical title types with the total times smaller than the eighth preset threshold value from all the title paragraphs.

In one embodiment, determining a hierarchical structure of a document from a title paragraph and a frame template includes:

taking all the title paragraphs as a title paragraph set, determining a frame template matched with the title paragraph set according to the title item of each title paragraph and the title item in each frame template, and taking the frame template as a target frame template;

filling the title item of each title paragraph into a matched target frame template to form a preliminary frame; wherein, the title items filled into the target frame template are used as the hierarchical titles in the preliminary frame;

from each preliminary frame, a hierarchical structure of the document is determined.

In one embodiment, before determining the hierarchical structure of the document according to each preliminary frame, the method further comprises:

and verifying the hierarchical titles in each preliminary frame based on a preset mode, wherein the preset mode comprises at least one of cutting, supplementing and correcting respectively.

In one embodiment, determining a hierarchical structure of a document from each preliminary frame includes:

determining a hierarchy corresponding to each hierarchy title in each preliminary frame;

determining a preliminary frame needing to be adjusted in level based on the overall level distribution of all the preliminary frames, and adjusting the level;

and generating a hierarchical structure of the document based on the overall frames obtained after all the preliminary frames are spliced.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

splitting a document into paragraphs;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

splitting a document into paragraphs;

A computer program product comprising a computer program which, when executed by a processor, performs the steps of:

splitting a document into paragraphs;

According to the question-answer pair construction method, the question-answer pair construction device, the computer equipment and the storage medium, the document is split into the paragraphs, and whether the paragraphs with the coexisting title and text exist in the split paragraphs is judged. If there is a paragraph with the title and the text, the paragraph with the title and the text are respectively segmented into different paragraphs according to the title and the text. According to all paragraphs in the document, question-answer pairs in the document are constructed. The paragraphs with the coexisting titles and texts in the document can be segmented and divided into different paragraphs, so that the titles and the texts in the same paragraph can be identified and constructed as question-answer pairs, and the application range is wider. In addition, the construction of question-answer pairs can be realized without content deletion, so that the accuracy of subsequent automatic answer is improved.

Drawings

FIG. 1 is a flow chart of a method for constructing question-answer pairs in one embodiment;

FIG. 2 is a flow chart of a method for coexistence of a header and a text in the same paragraph in one embodiment;

FIG. 3 is a diagram of a header and body coexisting in one embodiment in the same paragraph;

FIG. 4 is a diagram of a header and body coexisting in one embodiment in the same paragraph;

FIG. 5 is a code schematic for character format information in a document in one embodiment;

FIG. 6 is a flow chart of a method for constructing question-answer pairs in another embodiment;

FIG. 7 is a schematic diagram of a digital hierarchy header in one embodiment;

FIG. 8 is a schematic diagram of a digital hierarchy header in another embodiment;

FIG. 9 is a schematic diagram of a chapter level title in one embodiment;

FIG. 10 is a schematic diagram of an ordinal hierarchy header in one embodiment;

FIG. 11 is a schematic diagram of a frame template in one embodiment;

FIG. 12 is a schematic view of a frame template in yet another embodiment;

FIG. 13 is a schematic view of a frame template in another embodiment;

FIG. 14 is a flow chart of a method of constructing question-answer pairs in yet another embodiment;

FIG. 15 is a schematic diagram of the filling of a title paragraph into a target frame template in one embodiment;

FIG. 16 is a schematic diagram of a hierarchy in one embodiment;

FIG. 17 is a flow diagram of a topic paragraph determination process in one embodiment;

FIG. 18 is a schematic diagram of scope ranges in one embodiment;

FIG. 19 is a flow diagram of a process for word co-occurrence score computation in one embodiment;

FIG. 20 is a flow chart of a word co-occurrence score calculation process in yet another embodiment;

FIG. 21 is a flow diagram of a fall-through semantic score calculation process according to one embodiment;

FIG. 22 is a flow diagram of a process for semantic score computation of a sentence in one embodiment;

FIG. 23 is a flow diagram of a further screening process for all title paragraphs in one embodiment;

FIG. 24 is a schematic diagram of a title paragraph screened in one embodiment;

FIG. 25 is a flow diagram of a process for determining a hierarchy in one embodiment;

FIG. 26 is a schematic view of a frame template in yet another embodiment;

FIG. 27 is a schematic view of a frame template in yet another embodiment;

FIG. 28 is a schematic diagram of filling a title paragraph into a target frame template in another embodiment;

FIG. 29 is a schematic diagram of a supplemented preliminary framework in one embodiment;

FIG. 30 is a schematic diagram of a preliminary frame after cropping in one embodiment;

FIG. 31 is a schematic view of a preliminary frame after cropping in another embodiment;

FIG. 32 is a flow diagram of a process for determining a hierarchical structure of a document in one embodiment;

FIG. 33 is a schematic illustration of a hierarchical annotation in one embodiment;

FIG. 34 is a schematic illustration of a hierarchical annotation in another embodiment;

FIG. 35 is a schematic diagram of a level adjustment in one embodiment;

FIG. 36 is a schematic view of an overall frame in one embodiment;

FIG. 37 is a block diagram showing the construction of a question-answer pair construction device in one embodiment;

fig. 38 is an internal structural view of the computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various terms, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one term from another. For example, the third and fourth preset thresholds may be the same or different without departing from the scope of the application.

With the application of knowledge graphs and intelligent customer service in various industries, the use of information extraction technology to mine knowledge from documents has become a research hotspot. The automatic obtaining of question and answer pairs from a document is always a recognized difficulty, the question and answer pairs refer to question text and answer text matched with the question text, and the document can be a product specification document or a regulation document. In the actual implementation process, the question and answer pairs are automatically acquired from the document mainly by extracting titles of different levels of the document. In this extraction process, two key tasks are mainly involved: firstly, obtaining a title level in a document from the document; wherein, before obtaining the hierarchy (namely, several levels of catalogues), the title content of all the hierarchies in the document also needs to be obtained; secondly, analyzing the document structure by using the hierarchy of the title, so as to obtain attribute-attribute value pairs of the document; the attribute is a hierarchical title, and the attribute value is text content corresponding to the hierarchical title, so that attribute-attribute value pairs are taken as question-answer pairs.

From the above, the title level extraction of the document and the hierarchical analysis of the document structure become important and difficult in the whole process. In the related art, a document is divided into paragraphs according to the line feed characteristics of the paragraphs in the document, the titles in the divided paragraphs are determined, the text paragraphs under the titles are determined, the titles are finally used as question texts in question-answer pairs, and the text paragraphs under the titles are used as answer texts in the question-answer pairs. Because the title and the text may appear in the same paragraph, the situation cannot be identified according to the line feed characteristics of the paragraphs in the document, so that the title and the text in the same paragraph cannot be formed into question-answer pairs, the construction of the question-answer pairs has content deletion, and the accuracy of subsequent automatic replies is further affected.

Aiming at the problems in the related art, the embodiment of the application provides a method for constructing a question-answer pair. The method can be applied to constructing scenes by question and answer pairs. The execution body of the method may be a local server or a cloud server with a data processing function, which is not particularly limited in the embodiment of the present application. It should be noted that, in each embodiment of the present application, the number of "a plurality of" and the like refers to the number of "at least two", for example, "a plurality of" refers to "at least two". Referring to fig. 1, the method includes:

101. splitting a document into paragraphs;

102. judging whether a paragraph with a title and a text coexisting exists in the split paragraphs;

103. if there is a paragraph with the title and the text coexisting, dividing the paragraph with the title and the text into different paragraphs according to the title and the text respectively;

104. according to all paragraphs in the document, question-answer pairs in the document are constructed.

In the step 101, the document may be used as a content source of the question-answer pair, and may specifically be a product specification document or a regulation document, which is not limited in particular in the embodiment of the present application. Documents are typically composed of paragraphs that typically have a predetermined page layout format, e.g., the beginning of each paragraph is typically indented, the last line of each paragraph typically ends with a period and the last line is no longer content after the period. Thus, by detecting whether a text region conforming to a preset page layout format exists in the document, the text region can be used as an identified paragraph and stored as a basic unit.

The main idea of constructing question-answer pairs in the embodiment of the invention is to identify the titles in the document and the text under each title, so that the titles in the document and the text need to be distinguished as clearly as possible. Thus, in the above step 102, mainly the paragraph in which the title and the text coexist is identified, and the case in which the title and the text coexist in the same paragraph can be referred to fig. 2, 3 and 4. In fig. 2, it can be seen that there is a distinction between "fifth item [ settlement funds independence and break isolation ] and" financial product sales settlement funds belonging to financial product investors … "and the like, the former being clearly the title of the latter, and the latter being the body of the former. In fig. 3, "five (portfolio lifetime management)" is a title, and the contents of "commercial bank, bank financing subsidiaries each …" and the like are texts. In fig. 4, "5. Create dominant industry cluster" is the title, and "fully implement production base standardization, process marketing …", etc. is the text.

In fig. 2, 3 and 4, it is apparent that the header and the body coexist in the same paragraph, but the format is still different. As in fig. 2, the titles are selected with brackets. In fig. 3, the title is selected with brackets and the font is bolded. In fig. 4, the title is bolded and separated from the text by periods. Based on the above listed practical situations, for any paragraph obtained by splitting, the embodiment of the present invention does not specifically limit the manner of identifying whether the text and the title of the paragraph coexist, including but not limited to: judging whether the paragraph accords with a preset segmentation rule, if so, determining that the paragraph is a paragraph with a title and a text coexisting. The preset segmentation rule may include the following contents, which are respectively whether a bracket "()", a middle bracket "(") appears, a bold font appears, a digit appears and a pause sign follows the digit, and the like, which is not limited in the embodiment of the present invention. If the paragraph accords with the preset rule, determining that the text and the title coexist in the paragraph.

In the above step 103, for paragraphs where a title and a text coexist, the title and the text may be each cut into different paragraphs. Specifically, for a paragraph that meets the preset slicing rules, it may be determined which contents of the paragraph specifically meet the preset slicing rules, and based on this, the titles in the paragraph are determined. For example, if it is recognized that there is a bracket and a bolded font exists in the paragraph based on a preset segmentation rule, as shown in fig. 2, it may be determined that two brackets, a content selected between the two brackets, and a "fifth" of the bolded font are titles, and the other content of the paragraph than the above-mentioned titles is used as a text and respectively used as two paragraphs. Specifically, in connection with fig. 2, "fifth item [ settlement funds independence and bankruptcy isolation ] may be taken as a single paragraph, and then" financial product sales settlement funds belong to financial product investors … "and the like may be taken as a single paragraph.

The steps 101 to 103 are mainly preprocessing processes of the document, and mainly have two purposes, namely, splitting the content in the document by taking paragraphs as a unit according to the form of maintaining the context; and secondly, segmenting the paragraph with the title and the text coexisting. After all of the paragraphs in the document are obtained, the heading paragraphs in these paragraphs may be further identified in preparation for subsequent construction of question-answer pairs.

Since in step 103, the paragraph in which the title and the text coexist may be segmented, the title paragraph and the text paragraph are obtained. Thus, the title paragraph obtained by this segmentation can be used as a source of the title paragraphs in all paragraphs. In addition, some title paragraphs are actually separate from the beginning to the end, rather than being obtained by slicing the title and the text that coexist in the same paragraph as in the above process. Thus, the above-determined title paragraphs are removed, and the remaining paragraphs in all the paragraphs can be identified again to identify which of the remaining paragraphs are title paragraphs. Specifically, the similarity between each paragraph and the existing title template can be calculated, and the paragraph with the similarity greater than the preset threshold value is used as the title paragraph. After determining that the title segments in all paragraphs fall, question-answer pairs in the document may be reconstructed. Specifically, for two adjacent title paragraphs, the two title paragraphs may be respectively recorded as a previous title paragraph and a next title paragraph according to the ranking of the title paragraphs.

Because paragraphs have continuity, text paragraphs subordinate to a title paragraph typically follow the title paragraph in terms of their ordering. Thus, the text paragraph between the two heading paragraphs can be considered as the text paragraph subordinate to the previous heading paragraph. After determining that the text segment following each title paragraph falls, the title paragraph may be taken as a question in a question-answer pair, and the text paragraph following the title paragraph may be taken as a answer in the question-answer pair, so that a question-answer pair in the document may be constructed.

According to the method provided by the embodiment of the invention, the document is split into the paragraphs, and whether the paragraphs with the coexisting title and text exist in the split paragraphs is judged. If there is a paragraph with the title and the text, the paragraph with the title and the text are respectively segmented into different paragraphs according to the title and the text. According to all paragraphs in the document, question-answer pairs in the document are constructed. The paragraphs with the coexisting titles and texts in the document can be segmented and divided into different paragraphs, so that the titles and the texts in the same paragraph can be identified and constructed as question-answer pairs, and the application range is wider. In addition, the construction of question-answer pairs can be realized without content deletion, so that the accuracy of subsequent automatic answer is improved.

Documents typically exist in different file types, such as text files or text images, which may have different splitting processes when split into paragraphs. In combination with the foregoing embodiments, in one embodiment, the manner of splitting a document into paragraphs is not specifically limited by the embodiment of the present invention, including but not limited to the following two ways:

(1) The first way is: if the document is a text file, the document is split into paragraphs according to paragraph identifiers in the document.

The text file is usually a word (doc, docx) and a text edition pdf, and the semi-structured documents can be directly read by python XML, pymupdf library, java Apache package, and other tools. That is, the text content in the text file may be read directly. The readable content includes characters, tables, pictures and the like in the text file. Since characters can be read, paragraph identifiers such as indents or lineholders can also be read, so that the document can be broken into paragraphs based on the paragraph identifiers. Specifically, taking the paragraph identifier as the line-feeding symbol for example, since each paragraph is usually fed after the end, text content between two line-feeding symbols can be regarded as one paragraph.

It should be noted that, the characters in the title paragraph generally have a certain format, such as font enlarging and thickening, and these formats may also play a reference role in identifying the title paragraph when the text file is read, so that the read format information may also be retained in the reading process for further identification of the title paragraph. Wherein the code storing the format information may refer to fig. 5. In fig. 5, "section: [24, 41] "represents coordinate information of a character," font: song-like "represents font information of characters," size:16 "indicates the font size of the character, and" bold "indicates the character bolded.

(2) The second way is: if the document is a text image, character recognition is carried out on the document, the position information of each character in the document is determined, the position information of each text line in the document is determined according to the position information of each character, and different text lines are combined according to the position information of each text line, so that paragraphs in the document are obtained.

Unlike the above-described semi-structured document, the text image is not directly readable. Thus, the characters in the text image may be read by means of character recognition, such as by means of OCR (Optical Character Recognition ), whereby paragraphs in the document are determined from the combined result of the read characters. Wherein, OCR can recognize not only characters in a text image, but also position information of each character.

Specifically, paragraphs typically exist in paragraph format, e.g., the beginning of each paragraph is typically indented, and if a text line is the first line in a paragraph, then the beginning of that text line must also be indented. Whether there is a recess at the beginning of the text line can be determined from the position information of the text line, which can be derived from the position information of the characters in the text line. Therefore, the position information of each text line in the document can be obtained first, and then according to the position information of each text line, the text lines belonging to the same paragraph can be determined.

To visualize the location information of the characters, a coordinate system may be established based on the text image. Wherein the same character on the ordinate may be determined as a character in the same text line. Based on this principle, each text line in the text image, and the positional information of each text line, can be determined based on the positional information of each character in the text image. Wherein each text line may be represented by a box surrounding the text content of each line. Thus, the position information of each text line can be represented by left boundary coordinates, upper boundary coordinates, lower boundary coordinates, and right boundary coordinates.

After determining the location information for each text line, it may then be determined which text lines belong to the same paragraph. It can be seen from the above that the paragraph is usually in paragraph format, and the position information of the text lines can be used as the basis for which text lines can be aggregated to meet the paragraph format, so that it can be determined which text lines can be aggregated to form the paragraph. Based on the above description, for a certain page, the maximum length of each text line in the page can be determined according to the left boundary coordinate and the right boundary coordinate of the text line in the page. According to paragraph format, the lengths of the other text lines in the paragraph except the first line and the last line are usually the maximum length, while the first line is usually due to indentation and the last line is usually due to text content failing to fill the entire line, resulting in the lengths of the first line and the last line being less than the maximum length.

Based on the above principle, it is possible to preliminarily determine which text lines may be the first line or the last line in the paragraph by the lengths of the text lines. After this, since there is usually a indentation at the beginning of the paragraph, i.e. the first line in the paragraph is usually a distance from the left boundary of the page, such as a distance of an indentation, it is possible to further determine which text lines may be the first line in the paragraph, among the text lines initially determined to be the first line or the last line, based on the left boundary coordinates of the text lines. Since the last line in a paragraph usually ends with a period, by judging the last character in a text line, it is possible to further determine which text lines are likely to be the last line in a paragraph among text lines preliminarily judged as the first line or the last line.

Based on the coordinate system established by the text image, the text lines can be ordered in sequence according to the size of the ordinate. After determining which text lines in the text lines can be the first line and the last line in the paragraphs, and after determining the sequencing of the text lines, the text lines can be traversed one by one according to the sequencing of the text lines. If traversing to a text behavior 'first line', continuing traversing until traversing to a text behavior 'last line'. Thus, the "first line", the "last line", and the text line between the "first line" and the "last line" may be considered a paragraph. After determining the paragraph, the traversal may continue starting with the "last line" and repeating the process of determining paragraphs until all paragraphs in the document are determined.

It should be noted that, for further recognition of the caption paragraph, similar to the first mode, the format information of the characters may be retained in the second mode, such as font enlarging and thickening. In either of the above modes, images and tables may exist in the actual implementation document. In order to make these images and tables be used as paragraphs alone, the images can be converted into binary form and encoded using a preset encoding format, such as base64 format, so as to convert the images into character strings and facilitate the subsequent reconversion from character strings into images. Converting an image into a character string may facilitate processing the image as a normal paragraph.

For the form, since the form is actually formed by arranging text contents in an array, and text contents are expressed by coordinates in an array arrangement, the form can be processed into a form of "coordinates+values". Such as { x:0, y:0, value: hello, which represents the value of the first column of the first row of the table as "hello". Since the form can be converted into the above form for expression, the form contents can be formed and aggregated according to the above form, and the whole formed after aggregation can be handled as a general paragraph.

According to the method provided by the embodiment of the invention, as the document can be split into the paragraphs in an adaptive mode for different types of documents, the application range of the method can be improved.

As can be seen from the content of the above embodiments, when splitting a document into paragraphs, format information of characters can be simultaneously reserved for further recognition of the title paragraphs. Whether a paragraph is likely to be a title may be determined based on the format of the paragraph, which is determined by the format of the characters in the paragraph. Thus, after splitting a document into paragraphs, all paragraphs may be filtered based on their possible formats before question-answer pairs in the document are constructed from all paragraphs in the document. Further, it can be seen from the above that the format of the paragraph is determined by the format of the characters in the paragraph, that is, whether the paragraph is a title paragraph is determined based on the format of the characters in the paragraph. Based on this description, in conjunction with the content of the above embodiments, in one embodiment, referring to fig. 6, embodiments of the present invention do not specifically limit the manner in which question-answer pairs in a document are constructed from all paragraphs in the document, including but not limited to:

601. Selecting paragraphs meeting a first preset condition from all the paragraphs, wherein the first preset condition is used for measuring the possibility degree of the paragraphs being the title paragraphs as candidate title paragraphs;

602. and constructing question-answer pairs in the document according to all the candidate title paragraphs.

In step 601, a first preset condition may be used to measure the likelihood that a paragraph is a title paragraph. That is, if a paragraph meets the first preset condition, the likelihood that the paragraph is a title paragraph is higher. The number of setting targets of the first preset conditions is mainly two, and firstly, after all paragraphs are screened by the first preset conditions, real title paragraphs cannot be omitted; and secondly, selecting as few candidate heading paragraphs as possible, namely excluding text paragraphs as far as possible. The first preset condition may be set according to a possible format of the title paragraph, for example, font thickening and enlarging, which is not limited in particular in the embodiment of the present invention.

After candidate heading paragraphs are filtered out by step 601 described above, the remaining paragraphs in all paragraphs may be taken as text paragraphs in step 602. After the header paragraphs and the text paragraphs in all paragraphs are clarified and the ordering between the clarified paragraphs is performed, the question-answer pairs in the document can be constructed based on the method provided in the embodiment corresponding to fig. 1, which is not described herein.

According to the method provided by the embodiment of the invention, the paragraphs meeting the first preset condition are screened out from all the paragraphs and used as candidate title paragraphs. And constructing question-answer pairs in the document according to all the candidate title paragraphs. Since the first preset condition can be set based on the possible format of the title paragraph and the candidate title paragraph is screened out by the first preset condition, the candidate title paragraph can be screened out as accurately as possible.

As can be seen from the foregoing embodiments, the first preset condition may be set according to a possible format of the title paragraph. Based on this, in combination with the content of the above embodiments, in one embodiment, the first preset condition includes at least one of the following conditions: the paragraph sentence length is a first preset threshold value, the total word number of the paragraphs is smaller than a second preset threshold value, the total punctuation number of the paragraphs is smaller than a third preset threshold value, and the paragraph format satisfies the preset format.

Specifically, the paragraph length is used as one of the first preset conditions, because the title paragraph is usually short, such as a sentence (i.e. there is only one period or no period in the paragraph, and the sentence length is 1), so that it can be determined whether the paragraph is a candidate title paragraph by determining the paragraph length (i.e. the number of sentences in the paragraph). The first preset threshold may be 1, which is not specifically limited in the embodiment of the present invention.

The total number of words of the paragraphs is used as one item of content in the first preset condition, and the title paragraphs are generally compared to play a role of outline, so that the total number of words is not too large, and whether the paragraphs are candidate title paragraphs can be determined by judging whether the total number of words of the paragraphs is smaller than a second preset threshold value or not. The second preset threshold may be 40, which is not specifically limited in the embodiment of the present invention. The reason why 40 is set is that when one line of characters in a page is filled with characters of a common font size (e.g., 5-size characters), the total number of characters in one line of characters is generally 40.

The total punctuation number of the paragraphs is used as one item of content in the first preset condition, and the title paragraphs usually only contain a small number of punctuations or do not contain punctuations, so that the total punctuation number is not too large, and whether the paragraphs are candidate title paragraphs can be determined by judging whether the total punctuation number of the paragraphs is smaller than a third preset threshold value or not. The third preset threshold may be 3, which is not specifically limited in the embodiment of the present invention.

The reason for determining whether the paragraph format satisfies the preset format is that some fixed formats, such as middle paragraph, full bold font in the paragraph, and large font in the paragraph, are usually present in the title paragraph, and these formats can be reduced to the preset format for determining which paragraphs are in the format of the title paragraph, so that the paragraphs with the paragraph format satisfying the preset format can be used as possible candidate title paragraphs. It should be noted that, the four items of content given above with respect to the first preset condition are set only based on the limited known possible formats of the title paragraph, and may be set according to the actual requirement in the actual implementation process, which is not limited in particular in the embodiment of the present invention.

The types of hierarchical titles in a document can be summarized, and can be classified into three types, namely, digital type, chapter type and ordinal type. Wherein, the numerical type mainly refers to that the title item has obvious numbers, and the numbers are increased iteratively. As shown in fig. 7 and 8, these two figures are specific two forms of digital hierarchy headers. It can be seen from the figure that even the digital hierarchy header can be specifically divided into different forms, such as "digital, digital" or "digital, digital".

If a chapter type hierarchical title is included in a document, the document typically has a distinct hierarchical structure. Chapter type mainly refers to the hierarchical construction of a document in chapters, under which a paragraph hierarchy may be further divided by digital. As shown in fig. 9, fig. 9 is a specific form of chapter type hierarchical title. The regulations level title is commonly found in legal regulations and government documents, and the digital division paragraph level is further adopted under the regulations level. As shown in fig. 10, fig. 10 is a specific form of the legal hierarchy header.

Based on the three types of hierarchical titles described above, different types of framework templates may be constructed. Examples of the frame template are shown in fig. 11, 12 and 13. In fig. 11, "first chapter XXX" is a title paragraph, and "first chapter" therein is a title item as a chapter-type hierarchical title. "1.1XXX" is also a title paragraph, and "1.1" therein is a title item, which is a digital hierarchy title. In fig. 12, "first section XXX" is a title paragraph, and "first section" therein is a title item as a chapter-type hierarchical title. In fig. 13, there is no hierarchical title, and "XXX" is text. That is, a title paragraph may be composed of title items and/or title text, and the title items are in the title frame, i.e., as hierarchical titles. For example, "first chapter" in the title paragraph "first chapter XXX" is a title item, which is a chapter-type hierarchical title as a hierarchical title, and "XXX" in "first chapter XXX" is a title body.

Based on the foregoing, in conjunction with the foregoing embodiments, in one embodiment, referring to fig. 14, embodiments of the present invention do not specifically limit the manner in which question-answer pairs in a document are constructed from all candidate title paragraphs, including, but not limited to:

1401. Determining a title paragraph from all candidate title paragraphs;

1402. determining a hierarchical structure of the document according to the title paragraph and the frame template, wherein the hierarchical structure is used for representing the hierarchical relationship between hierarchical titles in the document;

1403. according to the hierarchical structure, question-answer pairs in the document are constructed.

In step 1401 described above, the candidate title paragraph may be directly taken as the title paragraph. Of course, the candidate title paragraphs may be further screened, for example, by manual screening, and the candidate title paragraphs remaining after screening are used as title paragraphs, which is not limited in detail in the embodiment of the present invention. Wherein the number of title paragraphs may be plural.

In step 1402 described above, a set of heading paragraphs may be first composed of all heading paragraphs. The embodiment of the invention does not specifically limit the manner of determining the hierarchical structure of the document according to the title paragraph and the frame template, and comprises but is not limited to: determining a frame template which is most matched with the title paragraph set, and taking the frame template as a target frame template; filling the title items in the title paragraphs in the title paragraph set into a target frame template to form a preliminary frame; and filling other paragraphs except the title paragraph in the split paragraphs into a preliminary frame to obtain a hierarchical structure of the document.

Specifically, the first set may be composed of the title items in each title paragraph. A corresponding second set of each frame template is determined. For a certain frame template, the second set corresponding to the frame template includes the title items contained in the frame template. And determining an intersection between the first set and each second set, taking the frame template corresponding to the maximum intersection as the frame template which is most matched with the title paragraph set, and taking the frame template as a target frame template. After the target frame template is determined, the title items in each title paragraph may be compared to the title items in the target frame template because the title items in each title paragraph are known and the title items contained in the target frame template are also known. For title paragraphs corresponding to aligned consistent title items, the title items of the title paragraphs may be populated to the target frame template.

It should be noted that, for the purpose of filling the title paragraphs corresponding to the title items with the same alignment to the target frame template, the purpose is mainly to represent the hierarchical relationship between the titles, so that in order to simplify the filling result, when filling the target frame template, only the title items of the title paragraphs corresponding to the title items with the same alignment may be filled to the target frame template. The remaining title paragraphs may not be filled in as they are usually titles where no obvious hierarchy exists or where the hierarchy structure builds a less meaningful title.

For ease of understanding, the filling process will now be explained. The preliminary frame formed by the filling process may be referred to fig. 15. In fig. 15, the existing tree structure is the target frame template. While the box with the text content of "1.1" is originally used for representing the hierarchical title existing in the target frame template, if the title paragraph with the title item of "1.1" exists in the title paragraph set, the title paragraph with the title item of "1.1" can be considered as the title paragraph corresponding to the title item of the matching, so that the title item of "1.1" can be filled in the target frame template. The filling may be marked, as shown in fig. 15, where "1.1" in the box with the text content of "1.1" is color-marked, which indicates that the text is filled with the title item of the title paragraph, and other color-marked areas in fig. 15 are also filled with the title item of the title paragraph.

Through the above process, a preliminary frame can be obtained. As can be seen from the foregoing embodiments, for two adjacent header paragraphs, according to the ranking order of the header paragraphs, the two header paragraphs may be respectively recorded as a previous header paragraph and a next header paragraph, and the text paragraph subordinate to a header paragraph generally follows the header paragraph, so that the text paragraph between the two header paragraphs is the text paragraph subordinate to the previous header paragraph.

Thus, after determining the preliminary frame, for the paragraphs split in the document, the paragraphs left after removing the heading paragraphs can be considered as text paragraphs, which can also be determined as to which heading paragraphs they belong to according to the description above. Based on the above, the hierarchical structure of the document can be obtained by placing other paragraphs except the title paragraph in the divided paragraphs and placing the divided paragraphs in the title paragraph to which the divided paragraphs belong. The hierarchical structure may not only characterize the hierarchical relationship between the titles in the document, but also which hierarchical title the text paragraph in the document belongs to, and refer to fig. 16 in particular. In fig. 16, "title 1", "title 2" and "title 6" are hierarchical titles, and "paragraph 4", "paragraph 5" and "paragraph 7" are text paragraphs. Both "paragraph 4" and "paragraph 5" are text paragraphs subordinate to "title 2", and "paragraph 7" is a text paragraph subordinate to "title 6". Whereas the indented relationship in fig. 16 represents a hierarchical relationship between hierarchical titles. It is apparent that "title 2" and "title 6" belong to the same hierarchical level title, and "title 1" is higher than both levels.

Through the above-described procedure, after obtaining the hierarchical structure, in step 1403, a hierarchical title in the hierarchical structure may be used as a question in a question-answer pair, and text segments following the hierarchical title may be used as answers in the question-answer pair, so that a question-answer pair in the obtained document may be constructed.

According to the method provided by the embodiment of the invention, the title paragraph is determined from all the candidate title paragraphs, and the hierarchical structure of the document is determined according to the title paragraph and the frame template. According to the hierarchical structure, question-answer pairs in the document are constructed. The framework template is obtained by summarizing the existing hierarchical structure between the title and the text in the document, so that the hierarchical structure of the document can be accurately determined based on the framework template, and further, the question-answer pairs in the subsequently constructed document are more accurate.

As can be seen from the above process, two key tasks are mainly involved in extracting titles from documents: firstly, acquiring a hierarchical title in a document; wherein, before obtaining the hierarchical title (namely, several levels of catalogues), the title content of all the levels in the document also needs to be obtained; secondly, analyzing the document structure by using the hierarchical title, so as to obtain attribute-attribute value pairs of the document; the attribute is a hierarchical title, and the attribute value is a text corresponding to the title, so that the attribute-attribute value pair is taken as a question-answer pair. From the above, the extraction of the hierarchical title and the hierarchical analysis of the document structure in the whole process are important and difficult.

In the related art, there are three parsing methods for extracting hierarchical titles in a document and parsing a document structure. The first analysis mode is mainly to establish a title rule and a hierarchy rule, traverse each paragraph in the document based on the title rule, determine the title according to the paragraph identification of the title rule, and determine the hierarchy of each title based on the hierarchy rule, thereby realizing the identification and association of the title and the hierarchy in the document.

The second analysis mode is mainly to match the text characteristics of each paragraph in the document to be processed with the paragraph characteristics in the predefined rule; under the condition that the feature matching is successful, determining paragraph levels of all paragraphs in the document to be processed according to a rule matching result; under the condition that rule matching fails, determining paragraph levels of all paragraphs in the document to be processed by using a machine learning model; a document title tree of the document to be processed is constructed based on the paragraph levels of the respective paragraphs.

The third analysis mode is mainly to divide the document to be structured into a plurality of single chapter documents according to a text structure recognition model; calculating the similarity between the chapter title of each chapter document and each template name in the structured template to obtain an adaptive template name; calculating the similarity between the elements corresponding to the names of the adaptation templates and the subordinate sentences of the corresponding chapter titles to obtain adaptation sentences; filling all the adaptive sentences of the single chapter documents into corresponding fillable areas in the structured template to obtain the structured document. The third parsing method needs to obtain the chapter title in advance, and is limited to parsing a document with a chapter format, which results in that the method has no universality.

In addition, in the above three parsing methods, the starting point is to screen out the hierarchical title from the paragraph of the document, that is, extract the title from the paragraph of the document, and then build the title hierarchy by using rules, title features or machine learning methods, so as to obtain the hierarchical title. Since the three parsing methods only consider the content of the paragraph itself when obtaining the hierarchical title, and based on this, determine the possibility of the paragraph as a title, there is usually a semantic link between ignoring the title and the following text. Ignoring semantic links may result in insufficient accuracy of the determined heading paragraph, thereby affecting the accuracy of subsequent building question-answer pairs. Therefore, the semantic relationship should also be used as a basis for the paragraph to be determined as a title. Therefore, the embodiment of the invention can also combine the semantic relation between paragraphs when determining the title paragraphs.

Based on the foregoing statements, in conjunction with the content of the foregoing embodiments, in one embodiment, referring to FIG. 17, embodiments of the present invention do not specifically limit the manner in which question-answer pairs in a document are constructed from all candidate heading paragraphs, including, but not limited to:

1701. obtaining a semantic association score of each candidate title paragraph, wherein the semantic association score is used for representing the semantic association degree between text content in the candidate title paragraph and text content in a scope of the candidate title paragraph, and the scope is used for representing the paragraph range covered by the candidate title paragraph;

1702. And selecting the candidate title paragraphs with semantic association scores meeting the second preset condition from all the candidate title paragraphs as title paragraphs.

In step 1701, the scope is used primarily to initially determine the paragraph scope covered by the candidate title paragraph. It will be appreciated that the scope of the candidate title paragraph is associated with the ordering of paragraphs in the document. Based on this, embodiments of the present invention are not particularly limited in the manner in which the scope of the candidate title paragraph is determined, including but not limited to: for any candidate title paragraph, if the candidate title paragraph is not directly followed by the candidate title paragraph, taking the paragraph between the candidate title paragraph and the next candidate title paragraph of the candidate title paragraph as the scope of the candidate title paragraph; if the candidate title paragraph directly follows the candidate title paragraph, the candidate title paragraph directly following the candidate title paragraph and the paragraph between the directly following candidate title paragraph and the next candidate title paragraph of the directly following candidate title paragraph are used as the scope of the candidate title paragraph. Through the process of defining the scope, the paragraph range covered by the candidate title paragraph can be preliminarily determined, so that the calculation range is defined for calculating the semantic association degree between the subsequent paragraph of the candidate title paragraph and the candidate title paragraph.

For example, as shown in fig. 18, the scope of "candidate title paragraph 1" is "paragraph 1", the scope of "candidate title paragraph 2" is "candidate title paragraph 3", "paragraph 2" and "paragraph 3", and the scope of "candidate title paragraph 3" is "paragraph 2" and "paragraph 3".

After determining the scope of the candidate title paragraph, the semantic association score of the candidate title paragraph may be obtained. The embodiment of the invention does not specifically limit the manner of obtaining the semantic association scores of the candidate title paragraphs, and includes but is not limited to: for any candidate title paragraph, integrating the word vector of each word in the candidate title paragraph to obtain the word vector corresponding to the candidate title paragraph; integrating the word vectors of each word in the scope of the candidate title paragraph to obtain the word vector corresponding to the scope of the candidate title paragraph; and calculating the similarity between the word vector corresponding to the candidate title paragraph and the word vector corresponding to the scope of the candidate title paragraph, and taking the similarity as the semantic association score of the candidate title paragraph. The dimensions of the word vectors of each word segment may be the same, and the integration manner may be vector superposition, which is not particularly limited in the embodiment of the present invention.

It should be noted that, in the actual implementation, the image and the table are taken as paragraphs, and may be in the scope of the candidate heading paragraphs. The above procedure actually requires calculating the semantic association scores of the candidate title paragraphs based on the text content in the scope of the candidate title paragraphs. Thus, when the image and the table need to be considered as paragraphs in the scope, how to participate in the process of calculating the semantic association scores. In the embodiment of the invention, if the image is based on binary conversion into a string, the string is actually just a pixel arrangement, and does not contain any semantics. Thus, if there is an image-corresponding paragraph in the scope of the candidate title paragraph, the paragraph may not participate in the process of calculating the semantic association score. Of course, in actual implementation, the image may be converted into text containing semantics. For example, the "see-through speaking" of an image may be implemented by a codec model, converting the image into text containing semantics. Thus, the text obtained by the conversion can be used as a common text paragraph to participate in the calculation process of the semantic association score.

For the form, the form is processed into a form of 'coordinates plus values', such as a string converted into the form, and the string can contain the semantics originally carried by the form. Thus, for the paragraph corresponding to the table, the paragraph can be used as a common text paragraph to participate in the calculation process of the semantic association score.

Through the above procedure, step 1702 may be performed after obtaining the semantic association score for each candidate title paragraph. The setting mode of the second preset condition can be associated with the type and the content of the semantic association score. For example, if the type of semantic association score is one and is the similarity between word vectors mentioned in the above process, the second preset condition may be that the semantic association score is greater than a preset threshold.

For another example, if the types of the semantic association scores are two, specifically, the respective similarities of the two types of word vectors, for example, the semantic association score may include a similarity between hot vectors and a similarity between word2vec vectors, the second preset condition may be divided into two sub-conditions, for example, the similarity between hot vectors needs to be greater than a certain preset threshold, and the similarity between word2vec vectors needs to be greater than another preset threshold.

According to the method provided by the embodiment of the invention, the semantic association score of each candidate title paragraph is obtained, and the candidate title paragraphs with the semantic association scores meeting the second preset condition are selected from all the candidate title paragraphs and used as the title paragraphs. Because the candidate title paragraphs can be screened based on semantic relation between the candidate title paragraphs and text contents under the scope of the candidate title paragraphs, screening accuracy can be improved, and further accuracy of subsequent construction of question-answer pairs is improved.

It is contemplated that if only a single type of semantic association score, it may not be possible to fully reflect the degree of semantic association between candidate title segments and scope. Based on this, in combination with the content of the above embodiments, in one embodiment, the semantic association score includes at least one of the following scores, the second preset condition being determined by the following sub-conditions; the following scores were respectively: word co-occurrence scores, paragraph semantic scores, and sentence semantic scores; the following sub-conditions include: the word co-occurrence score is greater than a fourth preset threshold, the paragraph semantic score is greater than a fifth preset threshold, and the sentence semantic score is greater than a sixth preset threshold; and the score types included in the semantic association scores are matched with the sub-conditions included in the second preset conditions.

Where the term appearing in the candidate heading section will typically also appear in the paragraph of its scope, i.e. the heading term will typically appear in the following paragraph, this phenomenon may be referred to as co-occurrence of the term in the candidate heading section. Generally, word co-occurrence can be used for text comparison, text abstracts and other aspects by calculating the frequency of word segmentation co-occurrence between two texts to obtain a quantization score. In the embodiment of the invention, in order to improve the redundancy of the word co-occurrence calculation effect, the word co-occurrence score may not adopt the frequency of the co-occurrence of the segmented words, but adopts the semantic association score among the segmented words. Thus, the word co-occurrence score is mainly used to characterize the semantic association between the tokens in the scope of the candidate title paragraph and the tokens in the candidate title paragraph, i.e. the semantic association between the two is characterized on the token level.

Similarly, the paragraph semantic score is mainly used for representing the semantic association degree between the paragraph and the candidate title paragraph in the action domain of the candidate title paragraph, namely representing the semantic association degree between the two on the paragraph level. The sentence semantic score is mainly used for representing the semantic association degree between the sentences in the action domain of the candidate heading paragraph and the sentences in the candidate heading paragraph, namely representing the semantic association degree between the sentences on the sentence level.

In addition, the "the types of the scores included in the semantic association scores are matched with the sub-conditions included in the second preset conditions" mentioned in the second preset conditions means that the total number of the types of the scores included in the semantic association scores is the same as the total number of the sub-conditions included in the second preset conditions and corresponds to each other one by one. For example, if the semantic association score includes a word co-occurrence score, the sub-condition included in the second preset condition may be only "the word co-occurrence score is greater than the fourth preset threshold". If the semantic association score includes a word co-occurrence score and a paragraph semantic score, the sub-condition included in the second preset condition may be that the word co-occurrence score is greater than a fourth preset threshold and the paragraph semantic score is greater than a fifth preset threshold. If the semantic association score includes a word co-occurrence score, a paragraph semantic score and a sentence semantic score, the sub-condition included in the second preset condition may be that the word co-occurrence score is greater than a fourth preset threshold value, the paragraph semantic score is greater than a fifth preset threshold value, and the sentence semantic score is greater than a sixth preset threshold value.

It should be appreciated that the more score types the semantic association scores include, the more accurate the screening of candidate title segments. Preferably, in the actual implementation process, the semantic association scores may include the three scores at the same time, and the second preset condition may include the three seed conditions at the same time.

Furthermore, vector expression is required to calculate similarity. Because different word segmentation can be performed by using word vectors with the same dimension, and sentences are formed by word segmentation, word vectors of word segmentation in the sentences are integrated, and different sentences can form vector expressions with the same dimension. Because the paragraphs are composed of sentences, sentence vectors in the paragraphs are integrated, so that different paragraphs form vector expressions with the same dimension. The integration manner may be vector superposition, which is not specifically limited in the embodiment of the present invention.

Based on the above-mentioned vector expression mode, when calculating the word co-occurrence score of the candidate heading paragraph, the word vector of each word in the candidate heading paragraph can be integrated first to obtain the word vector corresponding to the candidate heading paragraph. And integrating the word vectors of each word in the scope of the candidate title paragraph to obtain the word vector corresponding to the scope of the candidate title paragraph. Finally, the similarity between the word vector corresponding to the candidate title paragraph and the word vector corresponding to the scope of the candidate title paragraph is calculated and used as the word co-occurrence score of the candidate title paragraph.

When the sentence semantic score of the candidate heading section is calculated, one sentence can be selected from all sentences in the candidate heading section, and then one sentence can be selected from the action domain of the candidate heading section. Finally, the similarity between the sentence vectors of the two selected sentences is calculated to be used as the sentence semantic score of the candidate heading paragraph. The sentence vector of the sentence may be obtained by superposing word segmentation vectors of each word segmentation in the sentence.

When calculating the paragraph semantic score of the candidate heading paragraph, the sentence vectors of each sentence in the candidate heading paragraph can be integrated first to obtain the paragraph vector corresponding to the candidate heading paragraph. And integrating the sentence vectors of each sentence in the scope of the candidate heading paragraph to obtain the paragraph vector corresponding to the scope of the candidate heading paragraph. Finally, the similarity between the paragraph vector corresponding to the candidate heading paragraph and the paragraph vector corresponding to the scope of the candidate heading paragraph is calculated and used as the paragraph semantic score of the candidate heading paragraph. The sentence vector integration process may be to perform vector superposition on the sentence vector. Of course, in the actual implementation process, the word co-occurrence score, the sentence semantic score and the paragraph semantic score of the candidate title paragraph are calculated, and other manners other than the above-listed manners may be adopted, which are not limited in particular in the embodiment of the present invention.

According to the method provided by the embodiment of the invention, as the semantic association score can comprise a plurality of score types and the second preset condition can correspondingly comprise a plurality of sub-conditions, compared with a single judgment basis, the accuracy of the title paragraphs obtained by screening can be improved, and the accuracy of the subsequent question-answer pair construction is further improved.

In connection with the above embodiments, in one embodiment, referring to FIG. 19, the semantic association scores include word co-occurrence scores; accordingly, embodiments of the present invention are not particularly limited in the manner in which the semantic association score for each candidate title paragraph is obtained, including, but not limited to:

1901. for any candidate title paragraph, calculating the word segmentation similarity between each word segmentation in the candidate title paragraph and each word segmentation in the scope of the candidate title paragraph;

1902. determining the word co-occurrence score of each word segment in the candidate title segment according to the similarity of all the word segments corresponding to each word segment in the candidate title segment;

1903. and determining the word co-occurrence score of the candidate title according to the word co-occurrence score of each word in the candidate title.

In step 1901, before calculating the similarity between the segmented words, the candidate title paragraphs and their scope may be segmented. It should be noted that, because the stop word and the punctuation mark do not contain any semantics, the stop word and the punctuation mark can be deleted in the word segmentation process, so as to save the storage space and improve the word segmentation efficiency. Wherein, the stop word list can be manually input to determine which words are stop words.

After the candidate title paragraph and the scope thereof are segmented, the segmented words can be converted into word vectors through a preset model, for example, word2vec vectors are converted into word2vec vectors through a word2vec model. The preset model can be obtained by corpus training by using sample documents in the same field. After the word segmentation is converted into word vectors, the similarity between the word vectors can be calculated as the word segmentation similarity between the word segments.

In step 1901, since each word in the candidate heading section needs to calculate word similarity with each word in the scope, each word in the candidate heading section can obtain a preset number of word similarity. Wherein the preset number is the same as the total number of the segmentation words in the scope. In step 1902, for any word segment in the candidate heading segment, all word segment similarities corresponding to the word segment (i.e., a predetermined number of word segment similarities) may be integrated into a value, and the value may be used as a word co-occurrence score for the word segment. The integration process may be selecting a maximum value from a preset number of word segmentation similarities, or may be averaging the preset number of word segmentation similarities, which is not particularly limited in the embodiment of the present invention.

Thus, each word segment in the candidate heading section may obtain a word co-occurrence score. For any candidate title, in step 1903, the word co-occurrence scores of all the segmented words in the candidate title are integrated into a numerical value and used as the word co-occurrence scores of the candidate title. Specifically, an average value corresponding to the word co-occurrence score of all the segmented words in the candidate title paragraph may be calculated, or a maximum value corresponding to the word co-occurrence score of all the segmented words in the candidate title paragraph may be selected as the word co-occurrence score of the candidate title paragraph, which is not specifically limited in the embodiment of the present invention. Through steps 1901 to 1903 described above, a word co-occurrence score for each candidate heading section may be obtained.

According to the method provided by the embodiment of the invention, for any candidate title paragraph, the word segmentation similarity between each word segmentation in the candidate title paragraph and each word segmentation in the action domain of the candidate title paragraph is calculated. And determining the word co-occurrence score of each word segment in the candidate title segment according to the similarity of all the word segments corresponding to each word segment in the candidate title segment. And determining the word co-occurrence score of the candidate title according to the word co-occurrence score of each word in the candidate title. Because the word co-occurrence score of the candidate title paragraph is calculated by fully combining word segmentation semantic relation between the word segmentation in the candidate title paragraph and the word segmentation in the action domain, the word co-occurrence score of the candidate title paragraph is used as a judging basis for judging whether the candidate title paragraph is a title paragraph, the accuracy degree of the title paragraph obtained by screening can be improved, and the accuracy of the follow-up construction question-answer pair is improved.

In combination with the foregoing embodiments, in one embodiment, referring to fig. 20, embodiments of the present invention do not specifically limit the manner in which the word co-occurrence score of any candidate heading segment is determined based on the word co-occurrence score of each segmented word in any candidate heading segment, including but not limited to:

2001. selecting word segments with word co-occurrence scores greater than a seventh preset threshold value from the candidate title segments;

2002. summing the word co-occurrence scores corresponding to each screened word to obtain a sum value;

2003. and calculating the ratio between the sum value and the total word score in the candidate title paragraph, and taking the ratio as the word co-occurrence score of the candidate title paragraph.

Specifically, for any candidate title paragraph, the word co-occurrence score of the ith word segment in the candidate title paragraph is noted as P _i The total word score in the candidate title paragraph is marked as n, the seventh preset threshold is marked as m, and the word co-occurrence score of the candidate title paragraph is marked as P _appear Then P is calculated in steps 2001 to 2003 _appear The process of (2) can be referred to as the following formula (1):

the value of m may be set as required, for example, 0.5, which is not particularly limited in the embodiment of the present invention. Through the above formula (1), the word co-occurrence score of each candidate title segment can be calculated. It should be noted that, in the above formula (1), the word co-occurrence scores of all the segmented words are filtered through the seventh preset threshold, mainly because the word co-occurrence score of the candidate title paragraph is used as a basis for determining whether the candidate title paragraph is a title paragraph, and if the word co-occurrence score of some segmented words is lower, the confidence of the word co-occurrence score of some segmented words is lower. If the word co-occurrence scores of the divided words are introduced into the calculation process of the judgment basis, the accuracy of the judgment basis is reduced.

According to the method provided by the embodiment of the invention, for any candidate title section, the word segmentation with the word co-occurrence score larger than the seventh preset threshold value is screened from the candidate title section. And adding the word co-occurrence scores corresponding to each screened word to obtain a sum value. And calculating the ratio between the sum value and the total word score in the candidate title paragraph, and taking the ratio as the word co-occurrence score of the candidate title paragraph. Because the word co-occurrence scores of all the segmented words are screened through the seventh preset threshold value, the segmented words with lower word co-occurrence scores can be prevented from being introduced into the calculation process of the word co-occurrence scores of the candidate title paragraphs, and the accuracy of calculation results is reduced. Therefore, the accuracy of the title paragraphs obtained by screening can be improved, and the accuracy of the follow-up question and answer pair construction is improved.

In combination with the foregoing embodiments, in one embodiment, referring to FIG. 21, the semantic association scores include paragraph semantic scores; accordingly, embodiments of the present invention are not particularly limited in the manner in which the semantic association score for each candidate title paragraph is obtained, including, but not limited to:

2101. for any candidate title paragraph, calculating the paragraph similarity between the candidate title paragraph and each paragraph in the action domain of the candidate title paragraph;

2102. And determining the paragraph semantic score of the candidate title according to the similarity of each paragraph corresponding to the candidate title.

In step 2101, the paragraph vector of the candidate heading paragraph may be obtained before calculating the paragraph similarity between paragraphs, and the paragraph vector of each paragraph in the scope may be obtained. Since the word segmentation vector is available and the sentence is composed of the word segments and the paragraph is composed of the sentence, the paragraph vector of the paragraph can also be obtained based on the superposition mode mentioned in the above process. It should be noted that, in the above embodiment, paragraphs in the document may be filtered through the first preset condition to obtain candidate title paragraphs. And one content in the first preset condition is that the paragraph sentence length is a first preset threshold value, and under the condition that the first preset threshold value is 1, the total number of sentences in the candidate title paragraphs is 1 in fact. At this time, when the paragraph vector of the candidate heading section is acquired, the sentence vector of the sentence in the candidate heading section can be directly used as the paragraph vector of the candidate heading section without performing the superposition process of the sentence vector.

After obtaining the paragraph vector of the paragraphs, the paragraph similarity between the paragraphs can be calculated. For example, if the paragraph vector is represented by a word2vec vector, the similarity between the word2vec vectors of two paragraphs may be calculated and used as the paragraph similarity. In order to average the errors as much as possible, for a candidate heading paragraph and a paragraph in the scope thereof in the actual implementation process, the following procedure can be adopted in calculating the similarity of the paragraphs between the two paragraphs, considering that different types of vector expressions may bring errors respectively: calculating the paragraph similarity between the candidate title paragraph and the paragraph under each vector type; and integrating the paragraph similarity corresponding to each vector type, and taking the integrated result as the final paragraph similarity between the candidate title paragraph and the paragraph. The integration method may be weighted summation, which is not limited in particular in the embodiment of the present invention.

For ease of understanding, the above calculation process will be described by taking the word2vec vector and the Bert vector as examples, in which the total number of sentences in the candidate heading paragraphs is 1, the integration is weighted summation, the vector type is 2, and the calculation process can refer to the following formula (2):

α ₁ *(W _C ,W _Zi )+β ₁ *(B _C ,B _Zi )；(2)

wherein alpha is ₁ Representing the weight corresponding to word2vec vector, while beta ₁ Representing the weights corresponding to the Bert vector. W (W) _C C in (2) represents a candidate title paragraph, and W _C Word2vec vector, W, representing candidate heading section _Zi Word2vec vectors representing the i-th paragraph of the scope of candidate heading paragraphs. B (B) _C Bert vector representing candidate header segment, B _Zi The Bert vector representing the i-th paragraph of the scope of the candidate heading paragraph. The calculation result of the above formula (2) is the final paragraph similarity between the candidate title paragraph C and the ith paragraph in the scope.

For any candidate title, after obtaining the paragraph similarity between the candidate title and each paragraph in the scope, in step 2102, an average value of all the paragraph similarities may be calculated, as the paragraph semantic score of the candidate title, and the paragraph semantic score of the candidate title may be determined using the following formula (3):

P _para ＝max(α ₁ *(W _C ,W _Zi )+β ₁ *(B _C ,B _Zi ))；(3)

Wherein max represents the maximum value and P _para And expressing the paragraph semantic scores of the candidate title paragraphs, namely selecting the maximum value from all paragraph similarities corresponding to the candidate title paragraphs. Through the formula (2) and the formula (3), the paragraph semantic score of each candidate title paragraph can be calculated.

According to the method provided by the embodiment of the invention, for any candidate title paragraph, the paragraph similarity between the candidate title paragraph and each paragraph in the action domain of the candidate title paragraph is calculated. And determining the paragraph semantic score of the candidate title according to the similarity of each paragraph corresponding to the candidate title. Because the paragraph semantic scores of the candidate title paragraphs are calculated by fully combining the paragraph semantic relations between the candidate title paragraphs and the paragraphs in the action domain, the paragraph semantic scores of the candidate title paragraphs are used as the judging basis of whether the candidate title paragraphs are title paragraphs, the accuracy degree of the title paragraphs obtained by screening can be improved, and the accuracy of subsequent construction question-answer pairs is improved.

In connection with the above embodiments, in one embodiment, referring to FIG. 22, the semantic association scores include sentence semantic scores; accordingly, embodiments of the present invention are not particularly limited in the manner in which the semantic association score for each candidate title paragraph is obtained, including, but not limited to:

2201. For any candidate heading paragraph, calculating the sentence similarity between each sentence in the candidate heading paragraph and each sentence in the scope of the candidate heading paragraph;

2202. and determining the sentence semantic score of the candidate heading section according to the sentence similarity corresponding to each sentence in the candidate heading section.

In step 2201, before calculating the sentence similarity between the sentences, the sentence vectors of the sentences in the candidate heading paragraph may be obtained, and the sentence vector of each sentence in the scope may be obtained, and the specific process may refer to the description in the corresponding embodiment of fig. 21, which is not described herein. It should be noted that, the total number of sentences in the candidate heading section may be 1 after the first preset condition filtering, so the total number of sentence vectors in the candidate heading section may be only 1.

After the sentence vectors of the sentences are obtained, the sentence similarity between the sentences can be calculated. For example, if the sentence vector is represented by a word2vec vector, the similarity between word2vec vectors of two sentences may be calculated and used as the sentence similarity. For the same reason in the embodiment corresponding to fig. 21, in calculating the sentence similarity between the two sentences, the following calculation process average error may be adopted. Taking the total number of sentences in the candidate heading paragraphs as 1 and taking a sentence in a scope as an example, the calculation process corresponding to the sentence can be as follows: calculating the sentence similarity between the candidate heading paragraph and the sentence under each vector type; and integrating the sentence similarity corresponding to each vector type, and taking the integrated result as the final sentence similarity between the candidate heading paragraph and the sentence. The integration method may be weighted summation, which is not limited in particular in the embodiment of the present invention.

For ease of understanding, the above calculation process will be described by taking the integration method as weighted summation, the vector types are 2, and word2vec vector and Bert vector are respectively taken as examples, and the calculation process can refer to the following formula (4):

α ₂ *(W _C ,W _spij )+β ₂ *(B _C ,B _spij )；(4)

wherein alpha is ₂ Representing the weight corresponding to word2vec vector, while beta ₂ Representing the weights corresponding to the Bert vector. W (W) _C C in (C) represents a sentence in a candidate heading section, and C may also represent a candidate heading section since only one sentence is included in the candidate heading section. And W is _C Word2vec vector, W, representing candidate heading section _spij Word2vec vectors representing the jth sentence in the ith paragraph in the scope of the candidate heading paragraph. B (B) _C Bert vector representing candidate header segment, B _spij The Bert vector representing the jth sentence in the ith paragraph in the scope of the candidate heading paragraph. The calculation result of the above formula (4) is the final sentence similarity between the candidate heading paragraph C and the jth sentence in the ith paragraph in the scope.

For any candidate heading section, after the sentence similarity between the candidate heading section and each sentence in each section in the scope is obtained, in step 2202, an average value of all the sentence similarities may be calculated, and as the sentence semantic score of the candidate heading section, the sentence semantic score of the candidate heading section may also be determined using the following formula (5):

P _sen ＝max(α ₂ *(W _C ,W _spij )+β ₂ *(B _C ,B _spij ))；(5)

Wherein max represents the maximum value and P _sen The sentence semantic score representing the candidate heading section, i.e. the maximum value is selected from all sentence similarities corresponding to the candidate heading section. Through the formula (4) and the formula (5), the sentence semantic score of each candidate heading section can be calculated. It should be noted that the weights used in the formula (3) and the formula (5) may be the same or different, which is not particularly limited in the embodiment of the present invention.

Preferably, on the premise that the semantic association score simultaneously includes a word co-occurrence score, a paragraph semantic score and a sentence semantic score, and the second preset condition is determined by the following sub-condition, where the following sub-condition simultaneously includes that the word co-occurrence score is greater than a fourth preset threshold, the paragraph semantic score is greater than a fifth preset threshold and the sentence semantic score is greater than a sixth preset threshold, taking the fourth preset threshold as 0.8, the fifth preset threshold as 0.7 and the sixth preset threshold as an example, a process of judging whether the candidate caption section is a caption section, that is, a specific judging process corresponding to the second preset condition may refer to the following formula (6):

P _appear >0.8&(P _para >0.7|P _sen >0.7)；(6)

for equation (6) above, the second preset condition is equivalent to a word co-occurrence score greater than 0.8 and a paragraph semantic score greater than 0.7, or a word co-occurrence score greater than 0.8 and a sentence semantic score greater than 0.7. Of course, in the actual implementation process, the second preset conditions may also be different combinations according to the actual requirements, which is not limited in particular in the embodiment of the present invention.

It should be noted that, since the characters in the title are usually centered, thickened or enlarged, that is, the title usually has some preset format, the content of "paragraph format satisfies the preset format" may also be used for filtering the title paragraphs, and the content of "paragraph format satisfies the preset format" may be included in the first preset condition. From this, it can be summarized that the first preset condition is more to screen the title paragraph from the paragraph format, and the second preset condition is more to screen from the semantic association degree between the title and the following, and in the actual implementation process, in order to achieve a more accurate judgment result, the two angles cannot be respectively viewed, but need to be combined for viewing. That is, the first preset condition and the second preset condition may assist in performing the judgment. For example, although a paragraph is determined to be a title paragraph based on the second preset condition, the likelihood is low, but if the paragraph is found to conform to the title format given by the first preset condition, the comprehensive judgment can be performed by combining the judgment results corresponding to the first preset condition and the second preset condition, so as to obtain a more accurate judgment result.

Therefore, in the actual implementation, the sequential judgment process mentioned in the above embodiments may not be followed: the sequential judging process of screening the candidate title paragraphs based on the first preset condition and then screening the second title paragraphs from the candidate title paragraphs based on the second preset condition is not needed. The first preset condition and the second preset condition can be combined to perform comprehensive judgment, but the calculation process and the judgment basis in the judgment can also be based on the content mentioned in the embodiment of the invention. It should be noted that, the content that may be included in the first preset condition may be complementary to the content that may be included in the second preset condition. For example, if the first preset condition does not include the content of "paragraph format satisfies preset format", the second preset condition may include the content of "paragraph format satisfies preset format". The definition of the preset format may refer to the content of the above embodiment, and will not be described herein.

According to the method provided by the embodiment of the invention, for any candidate title paragraph, the sentence similarity between each sentence in the candidate title paragraph and each sentence in the acting domain of the candidate title paragraph is calculated. And determining the sentence semantic score of the candidate heading section according to the sentence similarity corresponding to each sentence in the candidate heading section. Because the sentence semantic score of the candidate heading paragraph is calculated by fully combining the sentence semantic relation between the sentences in the candidate heading paragraph and the sentences in the action domain, the sentence semantic score of the candidate heading paragraph is used as the judging basis of whether the candidate heading paragraph is the heading paragraph, the accuracy degree of the heading paragraph obtained by screening can be improved, and the accuracy of the subsequent construction question-answer pair is further improved.

The process of screening the title paragraphs from the candidate title paragraphs mentioned in the above embodiment is mainly implemented based on the degree of semantic association. In actual implementation, some paragraphs may be titles, but have no semantic association with the content of the following paragraphs. Such as "abstract," "brief introduction," "material," and "conclusion" of these catalogue nature. Thus, it is difficult to screen these titles from the candidate title segments by the related screening process of semantic association described above.

To address this issue, in conjunction with the content of the above embodiments, in one embodiment, the present embodiment does not specifically limit the manner in which a title paragraph is determined from all candidate title paragraphs, including, but not limited to: and screening out paragraphs meeting a third preset condition from the split paragraphs, wherein the third preset condition is set based on the catalog title characteristics as title paragraphs.

In particular, for the titles of the catalog properties "abstract", "brief introduction", "material" and "conclusion", these titles will generally satisfy a predetermined format. The preset format may be paragraph centering, full thickening of fonts in paragraphs, bigger fonts in paragraphs, etc., and the specific description of the preset format may refer to the content of the above embodiment, which is not repeated here. Thus, the third preset condition may include the content of "paragraph format satisfies preset format". It should be noted that, as shown in the foregoing embodiments, the first preset condition may also include "paragraph format satisfies preset format". The first preset condition includes the content because of the commonality of the titles, i.e., the paragraph format of the title paragraphs generally conforms to the preset format. In the embodiment of the invention, the catalog title is taken as a special title irrelevant to semantics, and the paragraph format which is usually present also accords with the preset format. In order to avoid overlapping the filtering conditions, if the third preset condition includes "the paragraph format satisfies the preset format", the first preset condition may not include the content of "the paragraph format satisfies the preset format".

In addition to paragraph formats that may satisfy the preset format, the catalog title may have other features. For example, a plain-text directory header such as "1.1" and a plain-text directory header such as "first chapter". Since the content of the catalog title of this type is not large, it is generally easy to be summarized, such as summarizing a template or rule, so that whether a paragraph is a catalog title can be determined by presetting the template or rule. The preset rule may be a regular expression, for example, "1.1" may be "number++number" by using the regular expression, and the directory title may be identified by using the regular expression. The preset template can be a character template obtained by summarizing the catalog title, such as a digital combination template or a combination template of chapter regulations related characters.

Correspondingly, the third preset condition may further include at least one of the following sub-conditions, where the sub-conditions respectively satisfy the preset rule and the similarity with the preset template is greater than the preset score threshold. The similarity can be calculated by word2vec vectors, and the preset score threshold value can be 0.9. In addition, since the number of words of a directory title is generally fixed, e.g., within 3 words, or 6 words, a paragraph whose actual number of words is not much different from the expected number of words of the directory title may also be a title paragraph. Therefore, the following sub-conditions may further include that the number of characters is within a preset interval.

It should be noted that, the title paragraphs screened out by the third preset condition are mainly used as hierarchical titles to construct a hierarchical structure. Since it is itself a mere title of a directory nature, it has no semantic connection as a title paragraph to a following paragraph and thus cannot be a question in question-answer pairs. Therefore, in the actual implementation process, the directory title may be selected not to be screened.

It should be noted that, as can be seen from the above embodiments, the relationship between the first preset condition and the second preset condition may be an auxiliary judgment relationship. Similarly, the third preset condition may be in an auxiliary judgment relationship with the first preset condition and the second preset condition in the actual implementation process. In fact, the actual implementation process may adopt a screening process of screening semantic association through a second preset condition to screen out some title paragraphs, and then screening out directory titles from the paragraphs obtained by splitting through a third preset condition, so as to avoid omitting those directory titles irrelevant to semantics only through semantic association screening. Finally, it should be noted that, in the title paragraphs obtained by screening the first preset condition, the second preset condition and the third preset condition, there may be repeated title paragraphs. Thus, for the title paragraphs that are screened via the above three conditions, the repeated title paragraphs therein can be screened out.

According to the method provided by the embodiment of the invention, the paragraphs meeting the third preset condition are screened out from the split paragraphs and are used as the title paragraphs. Because the third preset condition is set based on the characteristic of the catalog title, the catalog title can be screened based on the third preset condition, so that a complete hierarchical structure can be constructed later.

As can be seen from the above embodiments, the hierarchical titles mainly refer to title items, and the types of the hierarchical titles can be classified into chapter types, digital types, and organized types. In general, a type of hierarchical title does not appear only once, but may appear multiple times, for example, a chapter type hierarchical title such as "first chapter" appears in a certain document, and chapter type hierarchical titles such as "second chapter" and "third chapter" also exist. If a hierarchy header of a certain type occurs less frequently, it can be considered as a mis-screen of the previous screening process. Based on this, in combination with the content of the above embodiments, in one embodiment, all the title paragraphs may be further filtered before determining the hierarchical structure of the document based on the title paragraphs and the frame template. Referring to fig. 23, embodiments of the present invention are not limited in this regard as to the manner in which all title paragraphs are further filtered, including but not limited to:

2301. Determining the hierarchical title type corresponding to the title item in each title paragraph, and counting the total occurrence times of each hierarchical title type;

2302. and screening out the title paragraphs corresponding to the hierarchical title types with the total times smaller than the eighth preset threshold value from all the title paragraphs.

In step 2301, the definition of each of the title item and the hierarchical title type may refer to the content of the above embodiment, and will not be described herein. For example, if a title paragraph determined by a certain document is shown in fig. 24, the number of occurrences of the chapter-type hierarchical title is 2, which is "first chapter" and "second chapter", respectively. The number of occurrences of the digital hierarchy header in the form of "number" is 3, and "1.1", "1.2" and "2.1", respectively. Similarly, the number of occurrences of the digital hierarchy header in the form of "number.

In the above step 2302, the eighth preset threshold may be set according to requirements, for example, 3, which is not particularly limited in the embodiment of the present invention. For example, if the set of heading paragraphs of a document is {1.1xxx,1.2xxxx,1.4xxxx, (2) xxx,1.5xxxx }, it may be determined that the total number of occurrences of the hierarchical heading types in the form of "(number)" is 1, and if the eighth preset threshold is 3, the heading paragraph of "(2) xxx" may be screened out of all heading paragraphs because the total number is less than 3.

The method provided by the embodiment of the invention comprises the steps of determining the hierarchical title type corresponding to the title item in each title paragraph and counting the total occurrence times of each hierarchical title type. And screening out the title paragraphs corresponding to the hierarchical title types with the total times smaller than the eighth preset threshold value from all the title paragraphs. Because the title paragraphs which are screened out by mistake can be filtered out, the accuracy degree of the title paragraphs obtained by screening can be improved, and the accuracy of the subsequent question-answer pair construction is further improved.

In connection with the foregoing embodiments, in one embodiment, referring to FIG. 25, embodiments of the present invention do not specifically limit the manner in which the hierarchical structure of a document is determined from the title paragraphs and the frame templates, including, but not limited to:

2501. taking all the title paragraphs as a title paragraph set, determining a frame template matched with the title paragraph set according to the title item of each title paragraph and the title item in each frame template, and taking the frame template as a target frame template;

2502. filling the title item of each title paragraph into a matched target frame template to form a preliminary frame; wherein, the title items filled into the target frame template are used as the hierarchical titles in the preliminary frame;

2503. From each preliminary frame, a hierarchical structure of the document is determined.

In step 2501, "frame template matching a set of caption paragraphs" refers to a frame template containing caption items of caption paragraphs. For example, two frame templates as shown in fig. 26 and 27. If the title paragraphs included in the title paragraph set are as shown in fig. 24, the frame template corresponding to fig. 26 is determined to be the frame template matching the title paragraph set because the frame template corresponding to fig. 26 includes the title item "first chapter" in the title paragraph "first chapter XXXXX" in the title paragraph set. Similarly, since the frame template corresponding to fig. 27 includes the title item "1.1" in the title paragraph "1.1XXXX" in the title paragraph set, it is determined that the frame template corresponding to fig. 27 is also a frame template matching the title paragraph set.

In step 2502, the filling process may refer to fig. 28 and 15, and fig. 28 and 15 are two preliminary frames formed after the filling is completed.

In step 2503 above, the preliminary frames may be merged to obtain a hierarchical structure of the document. It should be noted that, after the title item of the title paragraph is converted to the target frame template to form the preliminary frame, the title item may exist as a hierarchical title in the preliminary frame, and the hierarchical title may have a further hierarchical level. For example, a chapter-type hierarchical title is generally higher in hierarchy than a digital-type hierarchical title. Thus, there may also be a relationship between the inclusion and inclusion between the hierarchy titles. Summarizing the rules of occurrence of hierarchical titles in a document, it is known that, according to the ordering of the title paragraphs, the hierarchical titles that occur first are typically hierarchical titles of a high hierarchy, and the hierarchical titles between two hierarchical titles of the same type and hierarchy are typically hierarchical titles that are subordinate to the hierarchical titles of the same type that are located in the front of the ordering. For example, as the "first chapter" and the "second chapter" are two hierarchy titles of the same type and the same hierarchy, the "first chapter" is located in the ranking front, and the hierarchy title "1.1" therebetween should be subordinate to the hierarchy title "first chapter". Based on the above rule, the level between the level titles can be further determined.

Based on the above description about the level of the hierarchy header, if the preliminary frame is regarded as a tree structure, the hierarchy header of the lower hierarchy may be used as a leaf node of the hierarchy header of the upper hierarchy to achieve splicing between hierarchy headers. After the splicing of the hierarchical titles is completed, the merging process of the preliminary frames is completed. And merging the primary frames to form a tree structure, namely a hierarchical structure.

According to the method provided by the embodiment of the invention, all the title paragraphs are used as the title paragraph sets, and the frame templates matched with the title paragraph sets are determined according to the title item of each title paragraph and the title item in each frame template and are used as target frame templates. The title item of each title paragraph is filled into the matched target frame template to form a preliminary frame. From each preliminary frame, a hierarchical structure of the document is determined. The framework template is obtained by summarizing the existing hierarchical structure between the title and the text in the document, so that the hierarchical structure of the document can be accurately determined based on the framework template, and further, the question-answer pairs in the subsequently constructed document are more accurate.

In practical implementation, the document structure is usually complex, the structure is not uniform, and in the related art, when the document structure is analyzed, the analysis is usually completed after each title belongs to a level in the document. However, in the related art, when a document structure is analyzed by a rule or the like, there are still cases where an analysis is erroneously performed or an analysis is omitted.

For example, in the embodiment of the present invention, if a certain filled level in the preliminary frame is entitled "2.2", but the level title is obtained by filling the title paragraph "2.2 jin apple price is too cheap," there is actually a title paragraph "2.2 accepted and applied method". It is obvious that the price of apples with 2.2 jin of title paragraph is too cheap, and the filling of the title paragraph with 2.2 is obviously not correct, and the apple price is an error analysis (filling) condition of 'dove taking up the pie nest'. Furthermore, if the hierarchical titles formed after filling in the preliminary frame have "1.1.1" and "1.1.3", it is obvious that the hierarchical title formed after filling should have "1.1.2" according to the general structure in the document. And the absence of this hierarchical header may be considered as the case of missing parsing (padding).

Based on the above-mentioned problems, in combination with the content of the above-mentioned embodiments, in one embodiment, before determining the hierarchical structure of the document according to each preliminary frame, further includes: and verifying the hierarchical titles in each preliminary frame based on a preset mode, wherein the preset mode comprises at least one of cutting, supplementing and correcting respectively.

The correction mode is mainly to correct the error filling situation of the dovetail nest. Thus, it is necessary to identify first which hierarchical titles are erroneously populated. From the above examples, it can be seen that the title paragraph corresponding to the true hierarchical title is typically characterized by a number followed by a space in the title paragraph, a generally short number of words in the title paragraph, fewer punctuations in the title paragraph, etc. Accordingly, the hierarchy header of the fourth preset condition, in which the false fill is recognized, may be set.

Specifically, for any preliminary frame, it may be determined whether each level of title formed by filling in the preliminary frame satisfies a fourth preset condition, if there is a level of title that does not satisfy the fourth preset condition, searching for a title paragraph having the same title item from all the title paragraphs according to the title item corresponding to the level of title that does not satisfy the fourth preset condition, and refilling the preliminary frame according to the searched title paragraph. It should be noted that, since the title items are the same, and the hierarchy title is mainly the title item presented to the outside, the refilling process may be to modify the correspondence between the hierarchy title and the title segment, that is, to modify the correspondence when the previous error is filled into the correspondence after refilling. The fourth preset condition may be set according to the requirement, and the fourth preset condition may include at least one of conditions that a space symbol exists between a number and a word in a title paragraph corresponding to the hierarchical title, a total number of words in the title paragraph corresponding to the hierarchical title is smaller than a ninth preset threshold, and a total number of punctuation marks in the title paragraph corresponding to the hierarchical title is smaller than a tenth preset threshold.

The supplementing mode is mainly based on continuity of the hierarchy titles, and supplements the hierarchy titles which may be missing. For specific procedures, reference is made to fig. 29, and the hierarchical titles in the dashed boxes in fig. 29 are hierarchical titles that are complemented based on the continuity of the hierarchical titles.

The clipping method mainly clips the title items of the preliminary frame which are not filled, as in fig. 29, "1", "2", "1.1.4", "1.3" and "1.4" in the preliminary frame are all title items which are not filled, so that the title items can be clipped. Specifically, the preliminary frame in fig. 29 may refer to fig. 30 through the preliminary frame formed after trimming, and fig. 28 may refer to fig. 31 through the preliminary frame formed after trimming.

According to the method provided by the embodiment of the invention, each preliminary frame is verified based on the preset mode. The preset mode comprises at least one mode selected from cutting, supplementing and correcting respectively. The hierarchical titles in the preliminary frame can be verified in the three modes, so that the hierarchical structure of the document can be accurately determined, and further, the question-answer pairs in the document can be accurately constructed later.

In connection with the foregoing embodiments, in one embodiment, and referring to FIG. 32, embodiments of the present invention do not specifically limit the manner in which the hierarchical structure of a document is determined from each preliminary frame, including, but not limited to:

3201. Determining a hierarchy corresponding to each hierarchy title in each preliminary frame;

3203. determining a preliminary frame needing to be adjusted in level based on the overall level distribution of all the preliminary frames, and adjusting the level;

3203. and generating a hierarchical structure of the document based on the overall frames obtained after all the preliminary frames are spliced.

Since the preliminary frame may be represented in the form of a tree structure, in the above step 3201, the hierarchy corresponding to each hierarchy header may be sequentially marked according to the subordinate relations between the parent node and the child nodes in the tree structure. Specifically, the hierarchy corresponding to the hierarchy header may refer to the labeling diagrams of fig. 33 and 34. The numbers near the title of each level in fig. 33 and 34 represent the corresponding level, and the smaller the number, the higher the level.

As can be seen from the above embodiments, the hierarchical titles may be further hierarchical, e.g., chapter-type hierarchical titles are generally higher in hierarchy than digital-type hierarchical titles. Thus, in the above step 3202, the overall hierarchical distribution of all the preliminary frames may be adjusted based on the original level of the hierarchy, specifically, the hierarchy corresponding to the hierarchical title in some of the preliminary frames is adjusted. Wherein, judging whether the preliminary frame needs to be adjusted can adopt the following processes: for any preliminary frame, determining the highest level of the level titles in the preliminary frame; if there is a higher level than the highest level in the other preliminary frames, a level adjustment is performed on the preliminary frames.

For example, for the two preliminary frames of fig. 33 and 34, since the chapter type hierarchy title is generally higher than the hierarchy of the digital hierarchy title, the hierarchy in the preliminary frame of fig. 34 can be adjusted based on the overall hierarchy distribution of the two preliminary frames. Since the highest level of the level title in fig. 34 is 1, such as "1.1" of the level title. The level higher than the highest level is only one level, for example, the level corresponding to the level title of "chapter" in fig. 33, so that only 1 is added to the level of the level title in fig. 34, and the preliminary frame after adjusting the level can refer to fig. 35 specifically.

In addition to the level of the hierarchy corresponding to the hierarchy title, the title items are presented as the external of the hierarchy title, and the content of different title items has a subordinate relationship. For example, in fig. 33, the chapter level title "first chapter" is a title item, and typically, the "first chapter" is followed by a digital title item, such as "1" or "1.1", where "one" in the "first chapter" corresponds to the number "1" to indicate a subordinate relationship. While a title item of digital type such as "2" or "2.1" will typically not follow the "first chapter", but may follow the "second chapter". Based on this, in step 3203 described above, all the preliminary frames may be spliced according to the affiliation of the title items. For example, fig. 36 may be referred to as an integral frame formed by splicing the two preliminary frames of fig. 33 and 35. It should be noted that, in the actual implementation process, steps 3202 and 3203 may be performed simultaneously.

After the overall framework is obtained through the above process, a hierarchical structure can be obtained based on the overall framework, and the detailed description of the above embodiment can be referred to, namely, other paragraphs except the title paragraph in the split obtained paragraphs are placed under the title paragraph to which the split obtained paragraphs belong, so that the hierarchical structure of the document can be obtained. The title paragraph corresponds to the hierarchical title, and the title paragraph is placed under the hierarchical title when the title paragraph falls. An example of a hierarchical structure may be found in which, with reference to fig. 16, the indented format in fig. 16 embodies the hierarchical relationship between hierarchical titles in a preliminary framework.

After the hierarchical structure is obtained, question-answer pairs in the document can be constructed according to the hierarchical structure. As can be seen from the contents of the above embodiments, the title may be taken as a question in the question pair, and the text following under the title may be taken as a reply. Thus, the hierarchical structure in fig. 16 can constitute three sets of question-answer pairs, as follows:

[ { title 1: [ title 2, paragraph 4, paragraph 5, title 6, paragraph 7] },1];

[ { title 2: [ paragraph 4, paragraph 5] },2];

[ { title 6: [ paragraph 7] },2];

wherein, the title 1 before the colon may represent the question, or as the content associated with the question, while the content after the colon represents the answer, or as the content associated with the answer, while the last numerals "1" and "2" represent the hierarchy. The hierarchy is preserved because it can distinguish the problem hierarchy and thus still have meaning for question-answer pair construction.

It should be noted that, since the title may be a title with directory property, the title itself does not contain any meaningful semantics, so that the title may be directly used as a question without being hard when the actual implementation process constructs a question-answering pair according to the hierarchical structure. But may be able to distinguish this layer characteristic of the question hierarchy based on the hierarchy of titles, determine the coverage of the questions and/or the search scope of the answers, to facilitate management of the dialogue resources of the question-answer pair. For example, since the possibility that a title of a hierarchical level is generally a directory title is high, the search range can be narrowed based on the title of the hierarchical level. For low-level titles, the text is usually followed under the titles, so that the text followed under the titles can be directly used as a reply of the questions.

According to the method provided by the embodiment of the invention, the preliminary frames needing to be adjusted in the hierarchy are determined based on the overall hierarchy distribution of all the preliminary frames by determining the hierarchy corresponding to each hierarchy title in each preliminary frame, and the hierarchy adjustment is performed. And generating a hierarchical structure of the document based on the overall frames obtained after all the preliminary frames are spliced. Because the hierarchical structure is obtained based on the overall hierarchical distribution of all the preliminary frames and the preliminary frames after the hierarchical adjustment, the hierarchical structure of the document can be accurately determined, and further, the question-answer pairs in the subsequent constructed document are more accurate.

It should be understood that, although the steps in the flowcharts of fig. 1, 6, 14, 17, 19, 20, 21, 22, 23, 25 and 32 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in fig. 1, 6, 14, 17, 19, 20, 21, 22, 23, 25, and 32 may include a plurality of steps or stages, which are not necessarily performed at the same time but may be performed at different times, and the order of the steps or stages may not necessarily be sequentially performed, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

It should be noted that, in the actual implementation process, the technical solutions described above may be implemented as independent embodiments, or may be implemented as combined embodiments by combining them. In addition, in describing the foregoing embodiments of the present invention, the different embodiments are described in a corresponding order, such as in a data flow direction order, based on a concept that is merely convenient for describing the embodiments, and not limiting the execution order between the different embodiments. Accordingly, in an actual implementation, if multiple embodiments provided by the present invention are required to be implemented, the execution sequence provided when the embodiments are set forth according to the present invention is not necessarily required, but the execution sequence between different embodiments may be arranged according to the requirement.

In combination with the foregoing embodiments, in one embodiment, as shown in fig. 37, there is provided a question-answer pair construction apparatus including: splitting module 3701, judging module 3702, splitting module 3703 and constructing module 3704, wherein:

a splitting module 3701 for splitting the document into paragraphs;

a judging module 3702, configured to judge whether there is a paragraph in which a title and a text coexist in the split paragraphs;

a segmentation module 3703, configured to segment, when there is a paragraph in which a title and a text coexist, the paragraph in which the title and the text coexist into different paragraphs according to the title and the text, respectively;

a construction module 3704 is configured to construct question-answer pairs in a document according to all paragraphs in the document.

In one embodiment, the splitting module 3701 is configured to split the document into paragraphs according to the paragraph identifiers in the document when the document is a text file; if the document is a text image, character recognition is carried out on the document, the position information of each character in the document is determined, the position information of each text line in the document is determined according to the position information of each character, and different text lines are combined according to the position information of each text line, so that paragraphs in the document are obtained.

In one embodiment, building module 3704 includes:

the screening sub-module is used for screening out paragraphs meeting a first preset condition from all the paragraphs, and the paragraphs are used as candidate title paragraphs, and the first preset condition is used for measuring the possibility degree of the paragraphs being the title paragraphs;

and the construction submodule is used for constructing question-answer pairs in the document according to all the candidate title paragraphs.

In one embodiment, constructing a sub-module includes:

a first determining unit configured to determine a caption paragraph from all candidate caption paragraphs;

the second determining unit is used for determining a hierarchical structure of the document according to the title paragraph and the frame template, wherein the hierarchical structure is used for representing the hierarchical relationship between hierarchical titles in the document;

and the construction unit is used for constructing question-answer pairs in the document according to the hierarchical structure.

In one embodiment, the first determining unit includes:

the acquisition subunit is used for acquiring a semantic association score of each candidate title paragraph, wherein the semantic association score is used for representing the semantic association degree between text content in the candidate title paragraph and text content in a scope of the candidate title paragraph, and the scope is used for representing the paragraph range covered by the candidate title paragraph;

And the selecting subunit is used for selecting the candidate title paragraphs with semantic association scores meeting the second preset condition from all the candidate title paragraphs as the title paragraphs.

In one embodiment, the semantic association scores include word co-occurrence scores; correspondingly, an acquisition subunit, configured to calculate, for any candidate title paragraph, a word segmentation similarity between each word segment in the candidate title paragraph and each word segment in a scope of the candidate title paragraph; determining the word co-occurrence score of each word segment in the candidate title segment according to the similarity of all the word segments corresponding to each word segment in the candidate title segment; and determining the word co-occurrence score of the candidate title according to the word co-occurrence score of each word in the candidate title.

In one embodiment, the obtaining subunit is configured to screen out, from the candidate heading section, the word segments having word co-occurrence scores greater than a seventh preset threshold; summing the word co-occurrence scores corresponding to each screened word to obtain a sum value; and calculating the ratio between the sum value and the total word score in the candidate title paragraph, and taking the ratio as the word co-occurrence score of the candidate title paragraph.

In one embodiment, the semantic association scores comprise paragraph semantic scores; correspondingly, the obtaining subunit is configured to calculate, for any candidate title paragraph, a paragraph similarity between the candidate title paragraph and each paragraph in the scope of the candidate title paragraph; and determining the paragraph semantic score of the candidate title according to the similarity of each paragraph corresponding to the candidate title.

In one embodiment, the semantic association scores comprise sentence semantic scores; correspondingly, an acquisition subunit, configured to calculate, for any candidate heading paragraph, a sentence similarity between each sentence in the candidate heading paragraph and each sentence in a scope of the candidate heading paragraph; and determining the sentence semantic score of the candidate heading section according to the sentence similarity corresponding to each sentence in the candidate heading section.

In one embodiment, the first determining unit is configured to screen out paragraphs that meet a third preset condition from the split paragraphs, and serve as the title paragraphs, where the third preset condition is set based on the directory title feature.

In one embodiment, constructing the sub-module further comprises:

a third determining unit, configured to determine a hierarchical title type corresponding to the title item in each title paragraph;

the statistics unit is used for counting the total occurrence times of each hierarchical title type;

and the screening unit is used for screening out the title paragraphs corresponding to the hierarchical title types with the total times smaller than the eighth preset threshold value from all the title paragraphs.

In one embodiment, the second determining unit includes:

a first determining subunit, configured to determine, according to the title item of each title paragraph and the title item in each frame template, a frame template that matches the title paragraph set, and use the frame template as a target frame template;

a filling subunit, configured to fill the title item of each title paragraph to a matched target frame template to form a preliminary frame; wherein, the title items filled into the target frame template are used as the hierarchical titles in the preliminary frame;

And the second determining subunit is used for determining the hierarchical structure of the document according to each preliminary frame.

In one embodiment, the second determining unit further comprises:

and the verification subunit is used for verifying the hierarchical titles in each preliminary frame based on a preset mode, wherein the preset mode comprises at least one of the following modes of cutting, supplementing and correcting respectively.

In one embodiment, the second determining subunit is configured to determine a level corresponding to each level header in each preliminary frame; determining a preliminary frame needing to be adjusted in level based on the overall level distribution of all the preliminary frames, and adjusting the level; and generating a hierarchical structure of the document based on the overall frames obtained after all the preliminary frames are spliced.

According to the device provided by the embodiment of the invention, the document is split into the paragraphs, and whether the paragraphs with the coexisting title and text exist in the split paragraphs is judged. If there is a paragraph with the title and the text, the paragraph with the title and the text are respectively segmented into different paragraphs according to the title and the text. According to all paragraphs in the document, question-answer pairs in the document are constructed. The paragraphs with the coexisting titles and texts in the document can be segmented and divided into different paragraphs, so that the titles and the texts in the same paragraph can be identified and constructed as question-answer pairs, and the application range is wider. In addition, the construction of question-answer pairs can be realized without content deletion, so that the accuracy of subsequent automatic answer is improved.

For specific limitations on the question-answer pair construction means, reference is made to the above limitations on the question-answer pair construction method, and no further description is given here. The respective modules in the question-answer pair construction device described above may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 38. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store a preset threshold. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a question-answer pair construction method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 38 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

splitting a document into paragraphs;

In one embodiment, the processor when executing the computer program further performs the steps of:

In one embodiment, the first preset condition includes at least one of the following conditions when the processor executes the computer program: the paragraph sentence length is a first preset threshold value, the total word number of the paragraphs is smaller than a second preset threshold value, the total punctuation number of the paragraphs is smaller than a third preset threshold value, and the paragraph format satisfies the preset format.

determining a title paragraph from all candidate title paragraphs;

In one embodiment, the semantic association score comprises at least one of the following scores when the processor executes the computer program, the second preset condition being determined by the following sub-conditions;

In one embodiment, the semantic association scores include word co-occurrence scores; accordingly, the processor when executing the computer program also performs the steps of:

for any candidate title paragraph, calculating the word segmentation similarity between each word segmentation in the candidate title paragraph and each word segmentation in the scope of the candidate title paragraph;

determining the word co-occurrence score of each word segment in the candidate title segment according to the similarity of all the word segments corresponding to each word segment in the candidate title segment;

And determining the word co-occurrence score of the candidate title according to the word co-occurrence score of each word in the candidate title.

selecting word segments with word co-occurrence scores greater than a seventh preset threshold value from the candidate title segments;

and calculating the ratio between the sum value and the total word score in the candidate title paragraph, and taking the ratio as the word co-occurrence score of the candidate title paragraph.

In one embodiment, the semantic association scores comprise paragraph semantic scores; accordingly, the processor when executing the computer program also performs the steps of:

for any candidate title paragraph, calculating the paragraph similarity between the candidate title paragraph and each paragraph in the action domain of the candidate title paragraph;

and determining the paragraph semantic score of the candidate title according to the similarity of each paragraph corresponding to the candidate title.

In one embodiment, the semantic association scores comprise sentence semantic scores; accordingly, the processor when executing the computer program also performs the steps of:

For any candidate heading paragraph, calculating the sentence similarity between each sentence in the candidate heading paragraph and each sentence in the scope of the candidate heading paragraph;

and determining the sentence semantic score of the candidate heading section according to the sentence similarity corresponding to each sentence in the candidate heading section.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

splitting a document into paragraphs;

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, the first preset condition includes at least one of the following conditions when the computer program is executed by the processor: the paragraph sentence length is a first preset threshold value, the total word number of the paragraphs is smaller than a second preset threshold value, the total punctuation number of the paragraphs is smaller than a third preset threshold value, and the paragraph format satisfies the preset format.

determining a title paragraph from all candidate title paragraphs;

In one embodiment, the semantic association score comprises at least one of the following scores when the computer program is executed by the processor, the second preset condition being determined by the following sub-conditions;

In one embodiment, the semantic association scores include word co-occurrence scores; accordingly, the computer program when executed by the processor also performs the steps of:

In one embodiment, the semantic association scores comprise paragraph semantic scores; accordingly, the computer program when executed by the processor also performs the steps of:

In one embodiment, the semantic association scores comprise sentence semantic scores; accordingly, the computer program when executed by the processor also performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. The method for constructing the question-answer pair is characterized by comprising the following steps:

splitting a document into paragraphs;

and constructing question-answer pairs in the document according to all paragraphs in the document.

2. The method of claim 1, wherein splitting the document into paragraphs comprises:

3. The method of claim 1, wherein constructing question-answer pairs in the document from all paragraphs in the document comprises:

selecting paragraphs meeting a first preset condition from all the paragraphs, wherein the first preset condition is used for measuring the possibility degree of the paragraphs being the title paragraphs, and the paragraphs are taken as candidate title paragraphs;

4. A method according to claim 3, wherein the first preset condition comprises at least one of the following conditions: the paragraph sentence length is a first preset threshold value, the total word number of the paragraphs is smaller than a second preset threshold value, the total punctuation number of the paragraphs is smaller than a third preset threshold value, and the paragraph format satisfies the preset format.

5. The method of claim 3 or 4, wherein said constructing question-answer pairs in said document from all candidate title paragraphs comprises:

determining a title paragraph from all the candidate title paragraphs;

and constructing question-answer pairs in the document according to the hierarchical structure.

6. The method of claim 5, wherein said determining a title paragraph from said all candidate title paragraphs comprises:

obtaining a semantic association score of each candidate title paragraph, wherein the semantic association score is used for representing the semantic association degree between text content in the candidate title paragraph and text content in a scope of the candidate title paragraph, and the scope is used for representing a paragraph range covered by the candidate title paragraph;

and selecting the candidate title paragraphs with semantic association scores meeting a second preset condition from all the candidate title paragraphs as title paragraphs.

7. The method of claim 6, wherein the semantic association score comprises at least one of the following scores, the second preset condition being determined by the following sub-conditions;

8. The method of claim 7, wherein the semantic association score comprises a word co-occurrence score; accordingly, the obtaining the semantic association score of each candidate title paragraph includes:

for any candidate title, calculating the word segmentation similarity between each word segmentation in the any candidate title and each word segmentation in the scope of the any candidate title;

9. The method of claim 8, wherein the determining the word co-occurrence score for any candidate heading segment based on the word co-occurrence score for each segmented word in the any candidate heading segment comprises:

and calculating the ratio between the sum value and the total word score in any candidate title paragraph, and taking the ratio as the word co-occurrence score of any candidate title paragraph.

10. The method of claim 7, wherein the semantic association score comprises a paragraph semantic score; accordingly, the obtaining the semantic association score of each candidate title paragraph includes:

for any candidate title, calculating the paragraph similarity between the any candidate title and each paragraph in the scope of the any candidate title;

and determining the paragraph semantic score of any candidate title paragraph according to the similarity of each paragraph corresponding to the any candidate title paragraph.

11. The method of claim 7, wherein the semantic association score comprises a sentence semantic score; accordingly, the obtaining the semantic association score of each candidate title paragraph includes:

for any candidate heading section, calculating the sentence similarity between each sentence in the any candidate heading section and each sentence in the scope of the any candidate heading section;

And determining the sentence semantic score of any candidate heading section according to the sentence similarity corresponding to each sentence in the candidate heading section.

12. The method of claim 5, wherein said determining a title paragraph from said all candidate title paragraphs comprises:

and screening out paragraphs meeting a third preset condition from the split paragraphs, and taking the paragraphs as title paragraphs, wherein the third preset condition is set based on the catalog title characteristics.

13. The method of claim 5, wherein prior to determining the hierarchical structure of the document from the title paragraph and frame template, further comprising:

14. The method of claim 5, wherein determining the hierarchical structure of the document from the title paragraph and frame template comprises:

according to each preliminary frame, a hierarchical structure of the document is determined.

15. The method of claim 14, wherein prior to determining the hierarchical structure of the document from each preliminary frame, further comprising:

and verifying the hierarchical titles in each preliminary frame based on a preset mode, wherein the preset mode comprises at least one of cutting, supplementing and correcting modes respectively.

16. The method of claim 14, wherein said determining a hierarchical structure of the document from each preliminary frame comprises:

17. A question-answer pair construction device, characterized in that the device comprises:

The splitting module is used for splitting the document into paragraphs;

the judging module is used for judging whether a paragraph with a title and a text coexisting exists in the split paragraphs;

the segmentation module is used for segmenting the paragraphs with the coexisting title and text into different paragraphs according to the title and text respectively when the paragraphs with the coexisting title and text exist;

and the construction module is used for constructing question-answer pairs in the document according to all paragraphs in the document.

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.