CN113641783A

CN113641783A - Key sentence based content block retrieval method, device, equipment and medium

Info

Publication number: CN113641783A
Application number: CN202010345947.1A
Authority: CN
Inventors: 林得苗
Original assignee: Pai Tech Co ltd
Current assignee: Pai Tech Co ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-12

Abstract

The invention discloses a content block retrieval method, a content block retrieval device, content block retrieval equipment and a content block retrieval medium based on key sentences. The method comprises the following steps: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved. According to the content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences, provided by the embodiment of the invention, the retrieval accuracy of the document can be improved.

Description

Key sentence based content block retrieval method, device, equipment and medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence.

Background

In order to obtain the content required by the user in the document to be retrieved, the user is required to manually input the keyword in the document tool, so that the position related to the keyword is found in the document to be retrieved. Taking the WORD document as an example, the search result of the keyword can be determined in the document by using the self-contained "search" function of the WORD document. Such as the sentence in which the keyword is located. The retrieval accuracy is low.

Disclosure of Invention

The content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences can improve the content block retrieval accuracy of the document.

In a first aspect, a method, an apparatus, a device and a medium method for retrieving a content block based on a key statement are provided, which includes: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.

According to the content block retrieval method based on the key sentences, the relevance grade of the content block and the key sentences in the document to be retrieved can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the document to be retrieved based on the relevancy scoring model. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.

In an optional implementation manner, obtaining a relevance score between a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence includes: extracting the characteristics of the content blocks of the document to be retrieved; and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.

In this example, by extracting the features of the content block and then calculating the relevancy score between the content block and the key sentence using the features of the content block, the calculation accuracy of the relevancy score can be ensured and the calculation speed can be increased.

In an alternative embodiment, the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.

In the example, by calculating the word features of the content blocks, the text features of the content blocks can be accurately identified, and the calculation accuracy of the relevancy scores is improved. By using the context word features of the content block, the calculation accuracy of the relevancy score can be improved according to the word features of the surrounding content blocks of the content block. Through the word characteristics of the superior titles of the titles corresponding to the content blocks, the relevancy score can be calculated according to the relevancy between the content blocks and the superior titles, and the calculation accuracy of the relevancy score is improved.

In an alternative embodiment, the extracting the feature of the content block of the document to be retrieved includes: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.

In this example, the features of the content block can be accurately obtained, thereby ensuring the accuracy of the correlation calculation.

In an alternative embodiment, obtaining the word feature of the upper level title of the content block based on the upper level title of the content block includes: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.

In this example, since the title corresponding to any node in the title logical tree is the higher-level title of the title corresponding to the child node of the node, the hierarchical relationship between the title components can be determined by establishing the title logical tree, so that the accuracy of the higher-level title of the content block can be accurately obtained, and the calculation accuracy of the relevancy score can be improved.

In an optional embodiment, the method further comprises: acquiring a title ordered sequence of a document to be retrieved; sequentially taking the titles of the sequences in the title order as first titles; for each first title, performing the following operations: if a subtree in which a previous title of a first title in the title logical tree is located has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title; and if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.

In the method for generating the directory structure by using the title template in the prior art, the number of the hierarchies of the target structure is the same as that set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method in the embodiment of the invention, whether the title component is the same level or not can be compared with the title component added to the logical tree of the title, and if the different level is different, the different level is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with a method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.

In an optional embodiment, the method further comprises: aiming at the relevance scoring model of the key sentences, the following operations are carried out: marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample; and training a relevance scoring model of the key sentences by using the positive samples and the negative samples.

The relevance scoring model is trained by a method of marking the content blocks as positive and negative samples, so that the accuracy of the model can be improved.

In an alternative embodiment, determining a target content block related to a key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences includes: and determining the first N content blocks with the highest relevancy scores as target content blocks.

In this embodiment, since the relevance score can accurately represent the relevance between the content block and the sentence to be retrieved, the retrieval accuracy can be improved by calculating the relevance score. In addition, the target content blocks with low relevance with the sentence to be retrieved are screened out, so that the content block retrieval efficiency can be improved, and the content block retrieval result expected by the user can be retrieved.

In a second aspect, a content block retrieval apparatus based on a key sentence is provided, including: the calculation module is used for obtaining a relevance score of a content block of a document to be retrieved and the key sentences based on the relevance score model of the key sentences, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; the determining module is used for determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.

In a third aspect, a content block retrieval device based on a key sentence is provided, including: a memory for storing a program; a processor, configured to execute a program stored in the memory to execute the method for retrieving a content block based on a key statement provided in the first aspect or any optional implementation manner of the first aspect.

In a fourth aspect, a computer storage medium is provided, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the method for retrieving a content block based on a key sentence provided in the first aspect or any optional implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating a key sentence based content block retrieval method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an exemplary logical tree of elements, in accordance with an embodiment of the present invention;

fig. 3 is a schematic structural diagram illustrating a content block retrieval apparatus based on a key sentence according to an embodiment of the present invention;

fig. 4 is a block diagram of an exemplary hardware architecture of a content block retrieval device based on a key sentence in the embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiment of the invention provides a content block retrieval scheme based on retrieval sentences, which is suitable for a specific scene of inputting the sentences to be retrieved in a document to perform key content block retrieval of the document content. The method is particularly suitable for a specific scene of searching the document with a specific text structure. Such as a survey of complex financial information texts including a survey instruction, a bond recruitment instruction, an annual report, a financial report, a merger and merger report, a rating report, a research report, a legal contract document, and public opinion news. The method is particularly suitable for searching the content blocks in the document. After the key sentences are obtained, the relevance grade of each content block in the document to be retrieved and the key sentences can be calculated, the target content blocks relevant to the key sentences are determined from the document to be retrieved according to the relevance grade, and the target content blocks are used as the content block retrieval results of the key sentences in the document to be retrieved.

In the embodiment of the invention, the document to be retrieved refers to an electronic document capable of acquiring the text and diagram information of the document. Specifically, it may be an electronic document in a WORD format, a PDF format, TXT, or the like. In addition, the document to be retrieved can be regarded as being composed of a plurality of paragraphs, wherein a table, a picture, a chart, a title and the like can be regarded as one paragraph respectively. Therefore, the document to be retrieved can be divided into a plurality of content blocks independent of each other in units of paragraphs. That is, the content block of the document to be retrieved includes at least one of a text content paragraph, a title, a table, a chart, and a picture.

Often, a multi-level title is often provided within the document to be retrieved. The first-level title, the second-level title, the third-level title and the like are arranged in the order from high to low in the hierarchy. There are often multiple low-level titles under the high-level title, with multiple low-level titles being subordinate to the high-level title. For the L < th > level title, it is subordinate to the top L-1 level title. For the L-th layer title, the top L-1 level titles to which the L-th layer title belongs are all top level titles. Illustratively, if there is a five-level heading "(1) in the second chapter of the document to be retrieved, the fixed asset case" has the following upper-level headings in the order of the hierarchy from low to high: four-level title "19, fixed asset", three-level title "seven, merge financial statement item comments", secondary title "eleventh section, financial report", primary title "chapter two, fixed asset". For ease of understanding, the following embodiments of the present invention will continue to be exemplarily described using the above-described five-level headings.

Since titles tend to be a high summarization of the content of one or more successive text content passages, each title tends to be followed immediately by one or more successive blocks of content, such as text content passages, pictures, charts, tables, and the like. In the embodiment of the present invention, it may be considered that a content block immediately following a certain title has a correspondence with the title. Illustratively, if the content blocks appear in the document to be retrieved, the sequence is three-level title A₃₁Text content paragraph B₂Table C₁Graph D₁Four-level title A₄₁Document content paragraph B₃Text content paragraph B₄Title three level A₃₂. Then document content paragraph B₂Table C₁Graph D₁Corresponding to the three-level title A₃₁Document content paragraph B₃Document content paragraph B₄Corresponding to four levels of title A₄₁。

In order to better understand the technical solution of the embodiment of the present invention, a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence according to the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart illustrating a key sentence-based content block retrieval method according to an embodiment of the present invention. As shown in fig. 1, the content block retrieval method 100 based on key sentences in the present embodiment may include the following steps:

s110, obtaining the relevance score of the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence.

First, for the key sentences, before S110, a plurality of key sentences may be preset in advance, and then one of the key sentences may be selected as the key sentence. The key sentences may characterize what the user desires to retrieve from the document to be retrieved. For example, for the instruction for stock, the key sentence may be "net profit", "business income of main business", or the like, which the user desires. The key sentence can be set according to the actual requirement and the application scene of the document to be retrieved, and the method is not limited. In addition, the key sentence may be composed of at least one complete sentence or at least one word, which is not particularly limited.

Next, a relevance score is given to the content block and the key sentence, and the relevance score is used for indicating the relevance degree of the content block and the key sentence. The higher the relevance score, the higher the relevance of the content block to the key sentence. For example, if the document to be retrieved includes M content blocks, the relevance scores of the M content blocks and the key sentences may be calculated respectively. Optionally, the relevance score ranges from [0,1 ]. In some embodiments, the relevance score may be determined based on a pre-trained relevance score model of the key-sentences.

Then, for the relevance scoring model of the key sentence, the relevance scoring module may select a Gradient Boosting Decision Tree (GBDT) Regression model or a Logistic Regression module. Preferably, in order to take account of the speed and accuracy of score calculation, the correlation score model uses a logistic regression module.

In the process of training the relevancy rating model, first, L content block samples may be selected in advance, the content block samples related to the key sentence in the L content block samples are marked as positive samples, and the content block samples unrelated to the key sentence in the L content block samples are marked as negative samples. And training a relevance grade model of the key sentences by using the positive samples and the negative samples. For the L content block samples, K training documents may be selected, and all content blocks in the K training documents are used as content block samples. The sum of the number of the content blocks of the K training documents is L. For positive and negative examples, the label of the positive example may be set to 1 and the label of the negative example may be set to 0. Specifically, for positive samples, the expected prediction value of the correlation score is set to 1, and for negative samples, the expected prediction value of the correlation score is set to 0. Optionally, if the relevance score is calculated by using the feature of the content block, and the feature includes R sub-features, the training data for training the relevance score model may be specifically implemented as a two-dimensional data matrix, where the two-dimensional data feature includes L rows and R columns.

In addition, it is considered that several key sentences may be preset in advance before S110, and one of them is selected as a key sentence. In order to ensure the calculation accuracy and the calculation speed of the relevancy score, a relevancy score model of each key sentence in the key sentences can be established. After determining the key sentences, a relevance scoring module of the key sentences can be selected from the trained relevance scoring models. The training method of each key sentence is similar to the training method of the relevancy scoring model of the key sentence, and is not described herein again.

In one example, if the financial information text needs to be retrieved by using the content block retrieval scheme based on the retrieval statement provided by the embodiment of the invention, the data text disclosed by each large stock exchange can be used as a training document. In order to verify the accuracy of the relevance score model trained by the embodiment of the invention, the relevance score model of each key sentence can be evaluated by using the recall rate of the 5 content blocks most relevant to the key sentence in the training text as an evaluation index. If P key sentences are included in total, the average value of the evaluation indexes of the P key sentences may be used as the total evaluation index. When data texts of a hong Kong stock exchange, a Shanghai stock exchange and a Shenzhen stock exchange are sequentially used as training samples, the total evaluation indexes of the relevancy scoring model can reach 88%, 97% and 93% respectively. The calculation accuracy of the correlation degree scoring model is extremely high.

Finally, in order to ensure the calculation accuracy of the correlation score, the features of each content block may be extracted first, and then the correlation score between the content block and the key sentence may be calculated using the features. Accordingly, a specific embodiment of S110 includes:

the first step, extract the characteristic of the content block of the file to be searched. Wherein the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block. Still alternatively, the characteristics of the content chunk may also include the location of the target content chunk in the document. The characteristics of the content blocks are different, and the specific implementation of the first step is also not completely the same. The following part of the present invention will be divided into three examples, and the specific embodiment of the first step will be specifically described with reference to the features of the content block.

In one example, the specific implementation of extracting the features of the content blocks of the document to be retrieved includes: and if the characteristics of the content block comprise word characteristics of the content block, preprocessing the content block. And then acquiring word characteristics of the preprocessed content blocks.

First, for the preprocessing operation, if the document to be retrieved is a chinese document, the preprocessing may include a word segmentation operation, a word segmentation may be performed by using a jieba (jieba) word segmentation technique, and other suitable word segmentation methods other than the jieba word segmentation method may be selected according to a specific working scenario and a working requirement, which is not limited herein. In the process of word segmentation, the set of characters of all cells in the table can be used as the character content of the table, and word segmentation processing is performed on the character content of the table. It should be noted that, if the document to be retrieved is a chinese document, word segmentation may be performed on the content block instead of the content block. For example, a word segmentation operation may be performed using an n-gram model. For example, if the content block includes a text content "content block search model", and n in the n-gram model takes a value of 2, the content block can be divided into information, extraction, module extraction, and model after the word segmentation operation is performed.

The preprocessing may also include operations to remove redundant text. Specifically, the lowest word frequency and/or the highest word frequency may be set. By setting the lowest word frequency, if the frequency of a certain word appearing in the document to be retrieved is lower than the lowest word frequency, the low-frequency word can be deleted from the content block, so that the unusual words, stop words and misspelled words in the document can be removed. By setting the highest word frequency, if the frequency of a certain word appearing in the document to be retrieved is higher than the highest word frequency, the high-frequency word can be deleted from the content block, so that words without actual meanings such as mood auxiliary words, structure auxiliary words and the like in the document can be removed, such as 'yes' and 'yes'. By removing the words, the accuracy of relevancy scoring can be improved.

Secondly, for the word feature of the content block, the word feature of the content block may be calculated based on a term frequency-inverse document frequency (tfidf) algorithm, and at this time, the word feature of the content block may be a word vector sparse matrix. Specifically, for each entry of each content block, a tfidf algorithm may be utilized to extract a first word feature characterizing word features of the entry in a table and a second word feature characterizing word features of the entry in a non-table. Illustratively, the word feature may be a tfidf value. In addition, the text content of the preprocessed content block can be calculated by using a tfidf algorithm, so that a word vector sparse matrix of the content block is obtained. It should be noted that, if the text in the target document is not chinese, the text block may also be processed directly by using the tfidf algorithm without preprocessing the content block, so as to obtain the word vector sparse matrix of the content block. In addition, other algorithms may be used to obtain the word features of the content block, which is not limited to this.

In another example, if the feature of the content block includes a context word feature of the content block, the context word feature of the content block is obtained based on the content block and an adjacent content block of the content block. The word features of the adjacent content blocks can be spliced to obtain the context word features of the content blocks. In addition, other algorithms may be used to obtain the context word features of the content block, which is not limited to this.

In yet another example, if the feature of the content block includes a word feature of a superordinate title of the content block, the word feature of the superordinate title of the content block is obtained based on the text content of the superordinate title of the content block. If the content block is not a title, the upper level title of the content block comprises a title corresponding to the content block and an upper level title of the title corresponding to the content block. When calculating the word feature of the upper title of the content block, the word feature of the upper title may be calculated based on the text content of the title corresponding to the content block and the text content of the upper title of the title corresponding to the content block. Alternatively, if the content block corresponds to the third-level title "seven" and the merged financial statement item annotation, "the word feature of the upper-level title of the title corresponding to the content block may be calculated from the text content of the content block, the third-level title" seven ", the merged financial statement item annotation," the second-level title "eleventh section, the financial report," and the first-level title "second chapter, and the fixed asset. The method for calculating the word features of the upper-level title of the content block is similar to the method for calculating the word features of the preprocessed content block in the first example, and is not described again.

In a specific example, in order to accurately extract the word feature of the upper title of the content block, the upper title of the content block needs to be accurately acquired. The manner of obtaining the upper title of the content block may include: the upper level title of the content block is determined according to the logical tree of content blocks.

First, the structure of the title logical tree may be as shown in fig. 2 with respect to the specific structure of the title logical tree. The title logical tree is composed of root nodes R₀And child node A₁-A₇The first sub-tree, sub-node A₈-A₁₃Second sub-tree, sub-node A of composition₁₄-A₁₉A third subtree of which A₁、A₈、A₁₄Is directly linked to R₀Three child nodes. The three subtrees have no direct connection relation with each other.

In the logical tree of headings shown in FIG. 2, the root node R₀May be the topic name of the document or the topic of the document. Still alternatively, the root node R shown in FIG. 2₀Or may be left vacant, i.e. the root node R₀And is not used to represent the hierarchical structure of a directory. All the child nodes constituting the three subtrees are titles. For any child node in the subtree, the parent node is the upper-level title, and the child node is the lower-level title. For example, child node A₁Is the first level header, child node A₂Is the second level title under the first level title.

Correspondingly, for any child node in the title logic tree, determining the upper-level title of the content block according to the content block logic tree specifically includes: connecting the root node of the element logic tree with the taskNodes between the meaning sub-nodes are all superior nodes of the arbitrary sub-nodes. E.g. for child node a₆The upper node comprises A₁、A₃And A₅。

Second, the method of constructing a logical tree of headers may include a first sub-step and a second sub-step as follows.

The method comprises the following specific steps:

the first substep, obtain the title ordered sequence of the file to be retrieved. The front and back order of each title in the title ordered sequence is the same as the appearance order of each title in the document to be retrieved. Illustratively, the titles A are sequentially arranged according to the sequence of the titles appearing in the document to be retrieved₁Title A₂… …, title A_mWherein the subscript of each title indicates the chronological order in which the title appears in the document. The title ordered sequence is title a₁Title A₂… …, title A_m}. Wherein m is a positive integer.

The first sub-step, regarding the titles in the ordered sequence of titles as first titles in turn, and for each first title, performing the following operations.

First, if a subtree in which a previous title of a first title is located in a logical tree of titles has a second title which is at the same level as the first title, the first title is used as a peer node of the second title, and a parent node of the peer node of the second title is the same as a parent node of the second title. And judging whether the first title and the second title are the same level or not through the title distinguishing model. For example, the heading discrimination model may include a feed Forward Neural Network (FNN) model and a second Softmax classifier.

In order to fully understand the first substep, the following sections of the present invention are specifically described with reference to FIG. 2 for the first substep. Continuing with FIG. 2, if title A₁₄As the first title, its preceding title A₁₃The corresponding sub-tree is composed of sub-node A₈And with child node A₈All child nodes A connected directly or indirectly₉To A₁₄And (4) forming a subtree. Then it needs to be at child node a₈To A₁₃Inter determinationWhether or not A is present₁₄The sibling title of (1). If A is₁₁And A₁₄At the same level, the same level will be A₁₄Is determined as A₁₀A child node of₁₄Is connected to A₁₀Below. If child node A₈Is A₁₄The same level of title of (A) is₁₄And root node R₀Are connected. At this time, A₁₄As the starting node of the third subtree in the logical tree of the header.

And secondly, if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title. Continuing with the previous example, if child node A₈To A₁₃Are not A₁₄The same level title component of A₁₄Is determined as A₁₃A child node of₁₄Is connected to A₁₃Below.

After the features of the content block are extracted in the first step, the S110 may further include a second step of inputting the features of the content block to the relevance score model to obtain a relevance score between the content block and the key sentence. For the relevant description of the relevancy score and the relevancy score model, reference may be made to the relevant contents in the above embodiments of the present invention, and details are not described herein again.

And S120, determining a target content block related to the key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence.

In some embodiments, if the document to be retrieved includes M content blocks, the top N content blocks with the highest relevance scores among the M content blocks may be determined as the target content blocks. Wherein N is a positive integer not greater than M. For example, N may be set according to a specific work scenario and work requirement, for example, N is equal to 100, which is not limited in this respect.

S130, the target content block is used as a content block retrieval result of the key sentence in the document to be retrieved.

According to the content block retrieval method based on the key sentences, the relevance grade of the content block in the document to be retrieved and the key sentences can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the key sentence based on the relevancy score. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.

In some embodiments, the target content blocks may be displayed on the display interface in order of high to low relevancy scores of the target content blocks to the key sentences.

In some embodiments, the location features of the target content piece may be extracted to quickly locate the target content piece when desired by the user. The position characteristic of the target content block may be a ratio of the number of pages of the content block in the document to be retrieved to the total number of pages of the document to be retrieved. For example, if the target content block is on page 7 of the document to be retrieved, and the document to be retrieved has a total of 12 pages, the value of the position characteristic of the target content block is 7/12.

An apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Based on the same inventive concept, the content block retrieval device based on the key sentence provided by the embodiment of the invention. Fig. 3 is a schematic structural diagram of a content block retrieval apparatus based on a key statement according to an embodiment of the present invention. As shown in fig. 3, the key sentence-based content block retrieval apparatus 300 includes:

the calculating module 310 is configured to obtain a relevance score between a content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence. Wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture.

And a determining module 320, configured to determine a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences.

And the result processing module 330 is configured to use the target content block as a content block retrieval result of the key sentence in the document to be retrieved.

In some embodiments of the present invention, the calculation module 310 includes an extraction unit and a scoring unit.

The extraction unit is used for extracting the characteristics of the content blocks of the document to be retrieved.

And the scoring unit is used for inputting the characteristics of the content block into the relevance scoring model to obtain the relevance score of the content block and the key sentence.

In some embodiments of the invention, the characteristics of the content chunk include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.

In some embodiments, the extraction unit is specifically configured to: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.

In some embodiments, the extraction unit is specifically configured to: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.

In some embodiments, the key sentence based content block retrieval apparatus 300 further includes an obtaining module and a logical tree generating module.

The acquisition module is used for acquiring the ordered sequence of the titles of the documents to be retrieved.

And the logical tree generating module is used for sequentially taking the titles of the sequence in the title order as the first title. And, for each first title, performing the following operations: and if a subtree in which the previous title of the first title is located in the logical tree of the titles has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein the parent node of the peer node of the second title is the same as the parent node of the second title.

And if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.

In some embodiments of the present invention, the content block retrieval apparatus 300 based on key sentences further comprises a model training module.

The model training module is used for executing the following operations aiming at the relevance scoring model of the key sentences: and marking the content block samples related to the key sentences as positive samples, and marking the content block samples unrelated to the key sentences as negative samples. And training a relevance scoring model of the key sentences by using the positive samples and the negative samples.

In some embodiments of the present invention, the determining module 320 is specifically configured to: and determining the first N content blocks with the highest relevancy scores as target content blocks.

Other details of the content block retrieval device based on the key sentence according to the embodiment of the present invention are similar to those of the content block retrieval method based on the key sentence according to the embodiment of the present invention described above with reference to fig. 1 to 2, and can achieve the corresponding technical effects, which are not described herein again.

As shown in fig. 4, the key sentence based content block retrieval device 400 includes an input device 401, an input interface 402, a central processor 403, a memory 404, an output interface 405, and an output device 406. The input interface 402, the central processing unit 403, the memory 404, and the output interface 405 are connected to each other through a bus 410, and the input device 401 and the output device 406 are connected to the bus 410 through the input interface 402 and the output interface 405, respectively, and further connected to other components of the content block retrieval device 400 based on the key sentence.

Specifically, the input device 401 receives input information from the outside and transmits the input information to the central processor 403 through the input interface 402; the central processor 403 processes the input information based on computer-executable instructions stored in the memory 404 to generate output information, stores the output information temporarily or permanently in the memory 404, and then transmits the output information to the output device 406 through the output interface 405; the output device 406 outputs the output information to the outside of the content block retrieval device 400 based on the key sentence for use by the user.

That is, the key sentence-based content block retrieval device shown in fig. 4 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus of the key sentence-based content block retrieval device described in conjunction with fig. 1-2.

In one embodiment, the key sentence based content block retrieval device 400 shown in fig. 4 may be implemented as a device that may include: a memory for storing a program; and the processor is used for operating the program stored in the memory to execute the content block retrieval method based on the key statement of the embodiment of the invention.

The embodiment of the invention also provides a computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when being executed by a processor, the computer program instructions realize the content block retrieval method based on the key statement.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

As will be apparent to those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Claims

1. A content block retrieval method based on key sentences is characterized by comprising the following steps:

obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;

determining a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences;

and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.

2. The method of claim 1,

the obtaining of the relevance score between the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence comprises:

extracting the characteristics of the content blocks of the document to be retrieved;

and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.

3. The method of claim 2, wherein the characteristics of the content block comprise at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.

4. The method of claim 2,

the extracting the characteristics of the content block of the document to be retrieved comprises the following steps:

if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation;

if the characteristics comprise the context word characteristics of the content block, obtaining the context word characteristics of the content block based on the content block and the adjacent content block of the content block;

and if the characteristics comprise the word characteristics of the superior titles of the content blocks, obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.

5. The method of claim 4, wherein obtaining the word feature of the superior title of the content block based on the superior title of the content block comprises:

determining an upper level title of the content block based on a title logical tree;

and obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.

6. The method of claim 5, further comprising:

acquiring a title ordered sequence of the document to be retrieved;

sequentially taking the titles of the sequences in the title order as first titles;

for each first title, performing the following operations:

if a subtree where a previous title of the first title is located in the title logical tree has a second title at the same level as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title;

7. The method of claim 1, further comprising:

aiming at the relevance scoring model of the key sentences, the following operations are carried out:

marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample;

and training a relevancy scoring model of the key sentences by using the positive samples and the negative samples.

8. The method of claim 1, wherein determining a target content block related to a key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence comprises:

and determining the first N content blocks with the highest relevancy scores as the target content blocks.

9. An apparatus for retrieving a content block based on a key sentence, the apparatus comprising:

the calculation module is used for obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;

a determining module, configured to determine, from the content blocks, target content blocks related to the key sentences based on the relevancy scores of the content blocks and the key sentences;

and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.

10. A content block retrieval device based on a key sentence, characterized by comprising:

a memory for storing a program;

a processor for executing the program stored in the memory to execute the key sentence based content block retrieval method of any one of claims 1 to 8.

11. A computer storage medium having computer program instructions stored thereon, which when executed by a processor, implement the key sentence-based content block retrieval method of any one of claims 1-8.