CN113641783A - Key sentence based content block retrieval method, device, equipment and medium - Google Patents

Key sentence based content block retrieval method, device, equipment and medium Download PDF

Info

Publication number
CN113641783A
CN113641783A CN202010345947.1A CN202010345947A CN113641783A CN 113641783 A CN113641783 A CN 113641783A CN 202010345947 A CN202010345947 A CN 202010345947A CN 113641783 A CN113641783 A CN 113641783A
Authority
CN
China
Prior art keywords
content block
title
content
key
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010345947.1A
Other languages
Chinese (zh)
Inventor
林得苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pai Tech Co ltd
Original Assignee
Pai Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pai Tech Co ltd filed Critical Pai Tech Co ltd
Priority to CN202010345947.1A priority Critical patent/CN113641783A/en
Publication of CN113641783A publication Critical patent/CN113641783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a content block retrieval method, a content block retrieval device, content block retrieval equipment and a content block retrieval medium based on key sentences. The method comprises the following steps: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved. According to the content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences, provided by the embodiment of the invention, the retrieval accuracy of the document can be improved.

Description

Key sentence based content block retrieval method, device, equipment and medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence.
Background
In order to obtain the content required by the user in the document to be retrieved, the user is required to manually input the keyword in the document tool, so that the position related to the keyword is found in the document to be retrieved. Taking the WORD document as an example, the search result of the keyword can be determined in the document by using the self-contained "search" function of the WORD document. Such as the sentence in which the keyword is located. The retrieval accuracy is low.
Disclosure of Invention
The content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences can improve the content block retrieval accuracy of the document.
In a first aspect, a method, an apparatus, a device and a medium method for retrieving a content block based on a key statement are provided, which includes: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
According to the content block retrieval method based on the key sentences, the relevance grade of the content block and the key sentences in the document to be retrieved can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the document to be retrieved based on the relevancy scoring model. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.
In an optional implementation manner, obtaining a relevance score between a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence includes: extracting the characteristics of the content blocks of the document to be retrieved; and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.
In this example, by extracting the features of the content block and then calculating the relevancy score between the content block and the key sentence using the features of the content block, the calculation accuracy of the relevancy score can be ensured and the calculation speed can be increased.
In an alternative embodiment, the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
In the example, by calculating the word features of the content blocks, the text features of the content blocks can be accurately identified, and the calculation accuracy of the relevancy scores is improved. By using the context word features of the content block, the calculation accuracy of the relevancy score can be improved according to the word features of the surrounding content blocks of the content block. Through the word characteristics of the superior titles of the titles corresponding to the content blocks, the relevancy score can be calculated according to the relevancy between the content blocks and the superior titles, and the calculation accuracy of the relevancy score is improved.
In an alternative embodiment, the extracting the feature of the content block of the document to be retrieved includes: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.
In this example, the features of the content block can be accurately obtained, thereby ensuring the accuracy of the correlation calculation.
In an alternative embodiment, obtaining the word feature of the upper level title of the content block based on the upper level title of the content block includes: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.
In this example, since the title corresponding to any node in the title logical tree is the higher-level title of the title corresponding to the child node of the node, the hierarchical relationship between the title components can be determined by establishing the title logical tree, so that the accuracy of the higher-level title of the content block can be accurately obtained, and the calculation accuracy of the relevancy score can be improved.
In an optional embodiment, the method further comprises: acquiring a title ordered sequence of a document to be retrieved; sequentially taking the titles of the sequences in the title order as first titles; for each first title, performing the following operations: if a subtree in which a previous title of a first title in the title logical tree is located has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title; and if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
In the method for generating the directory structure by using the title template in the prior art, the number of the hierarchies of the target structure is the same as that set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method in the embodiment of the invention, whether the title component is the same level or not can be compared with the title component added to the logical tree of the title, and if the different level is different, the different level is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with a method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.
In an optional embodiment, the method further comprises: aiming at the relevance scoring model of the key sentences, the following operations are carried out: marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample; and training a relevance scoring model of the key sentences by using the positive samples and the negative samples.
The relevance scoring model is trained by a method of marking the content blocks as positive and negative samples, so that the accuracy of the model can be improved.
In an alternative embodiment, determining a target content block related to a key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences includes: and determining the first N content blocks with the highest relevancy scores as target content blocks.
In this embodiment, since the relevance score can accurately represent the relevance between the content block and the sentence to be retrieved, the retrieval accuracy can be improved by calculating the relevance score. In addition, the target content blocks with low relevance with the sentence to be retrieved are screened out, so that the content block retrieval efficiency can be improved, and the content block retrieval result expected by the user can be retrieved.
In a second aspect, a content block retrieval apparatus based on a key sentence is provided, including: the calculation module is used for obtaining a relevance score of a content block of a document to be retrieved and the key sentences based on the relevance score model of the key sentences, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; the determining module is used for determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
In a third aspect, a content block retrieval device based on a key sentence is provided, including: a memory for storing a program; a processor, configured to execute a program stored in the memory to execute the method for retrieving a content block based on a key statement provided in the first aspect or any optional implementation manner of the first aspect.
In a fourth aspect, a computer storage medium is provided, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the method for retrieving a content block based on a key sentence provided in the first aspect or any optional implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating a key sentence based content block retrieval method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an exemplary logical tree of elements, in accordance with an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a content block retrieval apparatus based on a key sentence according to an embodiment of the present invention;
fig. 4 is a block diagram of an exemplary hardware architecture of a content block retrieval device based on a key sentence in the embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the invention provides a content block retrieval scheme based on retrieval sentences, which is suitable for a specific scene of inputting the sentences to be retrieved in a document to perform key content block retrieval of the document content. The method is particularly suitable for a specific scene of searching the document with a specific text structure. Such as a survey of complex financial information texts including a survey instruction, a bond recruitment instruction, an annual report, a financial report, a merger and merger report, a rating report, a research report, a legal contract document, and public opinion news. The method is particularly suitable for searching the content blocks in the document. After the key sentences are obtained, the relevance grade of each content block in the document to be retrieved and the key sentences can be calculated, the target content blocks relevant to the key sentences are determined from the document to be retrieved according to the relevance grade, and the target content blocks are used as the content block retrieval results of the key sentences in the document to be retrieved.
In the embodiment of the invention, the document to be retrieved refers to an electronic document capable of acquiring the text and diagram information of the document. Specifically, it may be an electronic document in a WORD format, a PDF format, TXT, or the like. In addition, the document to be retrieved can be regarded as being composed of a plurality of paragraphs, wherein a table, a picture, a chart, a title and the like can be regarded as one paragraph respectively. Therefore, the document to be retrieved can be divided into a plurality of content blocks independent of each other in units of paragraphs. That is, the content block of the document to be retrieved includes at least one of a text content paragraph, a title, a table, a chart, and a picture.
Often, a multi-level title is often provided within the document to be retrieved. The first-level title, the second-level title, the third-level title and the like are arranged in the order from high to low in the hierarchy. There are often multiple low-level titles under the high-level title, with multiple low-level titles being subordinate to the high-level title. For the L < th > level title, it is subordinate to the top L-1 level title. For the L-th layer title, the top L-1 level titles to which the L-th layer title belongs are all top level titles. Illustratively, if there is a five-level heading "(1) in the second chapter of the document to be retrieved, the fixed asset case" has the following upper-level headings in the order of the hierarchy from low to high: four-level title "19, fixed asset", three-level title "seven, merge financial statement item comments", secondary title "eleventh section, financial report", primary title "chapter two, fixed asset". For ease of understanding, the following embodiments of the present invention will continue to be exemplarily described using the above-described five-level headings.
Since titles tend to be a high summarization of the content of one or more successive text content passages, each title tends to be followed immediately by one or more successive blocks of content, such as text content passages, pictures, charts, tables, and the like. In the embodiment of the present invention, it may be considered that a content block immediately following a certain title has a correspondence with the title. Illustratively, if the content blocks appear in the document to be retrieved, the sequence is three-level title A31Text content paragraph B2Table C1Graph D1Four-level title A41Document content paragraph B3Text content paragraph B4Title three level A32. Then document content paragraph B2Table C1Graph D1Corresponding to the three-level title A31Document content paragraph B3Document content paragraph B4Corresponding to four levels of title A41
In order to better understand the technical solution of the embodiment of the present invention, a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence according to the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart illustrating a key sentence-based content block retrieval method according to an embodiment of the present invention. As shown in fig. 1, the content block retrieval method 100 based on key sentences in the present embodiment may include the following steps:
s110, obtaining the relevance score of the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence.
First, for the key sentences, before S110, a plurality of key sentences may be preset in advance, and then one of the key sentences may be selected as the key sentence. The key sentences may characterize what the user desires to retrieve from the document to be retrieved. For example, for the instruction for stock, the key sentence may be "net profit", "business income of main business", or the like, which the user desires. The key sentence can be set according to the actual requirement and the application scene of the document to be retrieved, and the method is not limited. In addition, the key sentence may be composed of at least one complete sentence or at least one word, which is not particularly limited.
Next, a relevance score is given to the content block and the key sentence, and the relevance score is used for indicating the relevance degree of the content block and the key sentence. The higher the relevance score, the higher the relevance of the content block to the key sentence. For example, if the document to be retrieved includes M content blocks, the relevance scores of the M content blocks and the key sentences may be calculated respectively. Optionally, the relevance score ranges from [0,1 ]. In some embodiments, the relevance score may be determined based on a pre-trained relevance score model of the key-sentences.
Then, for the relevance scoring model of the key sentence, the relevance scoring module may select a Gradient Boosting Decision Tree (GBDT) Regression model or a Logistic Regression module. Preferably, in order to take account of the speed and accuracy of score calculation, the correlation score model uses a logistic regression module.
In the process of training the relevancy rating model, first, L content block samples may be selected in advance, the content block samples related to the key sentence in the L content block samples are marked as positive samples, and the content block samples unrelated to the key sentence in the L content block samples are marked as negative samples. And training a relevance grade model of the key sentences by using the positive samples and the negative samples. For the L content block samples, K training documents may be selected, and all content blocks in the K training documents are used as content block samples. The sum of the number of the content blocks of the K training documents is L. For positive and negative examples, the label of the positive example may be set to 1 and the label of the negative example may be set to 0. Specifically, for positive samples, the expected prediction value of the correlation score is set to 1, and for negative samples, the expected prediction value of the correlation score is set to 0. Optionally, if the relevance score is calculated by using the feature of the content block, and the feature includes R sub-features, the training data for training the relevance score model may be specifically implemented as a two-dimensional data matrix, where the two-dimensional data feature includes L rows and R columns.
In addition, it is considered that several key sentences may be preset in advance before S110, and one of them is selected as a key sentence. In order to ensure the calculation accuracy and the calculation speed of the relevancy score, a relevancy score model of each key sentence in the key sentences can be established. After determining the key sentences, a relevance scoring module of the key sentences can be selected from the trained relevance scoring models. The training method of each key sentence is similar to the training method of the relevancy scoring model of the key sentence, and is not described herein again.
In one example, if the financial information text needs to be retrieved by using the content block retrieval scheme based on the retrieval statement provided by the embodiment of the invention, the data text disclosed by each large stock exchange can be used as a training document. In order to verify the accuracy of the relevance score model trained by the embodiment of the invention, the relevance score model of each key sentence can be evaluated by using the recall rate of the 5 content blocks most relevant to the key sentence in the training text as an evaluation index. If P key sentences are included in total, the average value of the evaluation indexes of the P key sentences may be used as the total evaluation index. When data texts of a hong Kong stock exchange, a Shanghai stock exchange and a Shenzhen stock exchange are sequentially used as training samples, the total evaluation indexes of the relevancy scoring model can reach 88%, 97% and 93% respectively. The calculation accuracy of the correlation degree scoring model is extremely high.
Finally, in order to ensure the calculation accuracy of the correlation score, the features of each content block may be extracted first, and then the correlation score between the content block and the key sentence may be calculated using the features. Accordingly, a specific embodiment of S110 includes:
the first step, extract the characteristic of the content block of the file to be searched. Wherein the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block. Still alternatively, the characteristics of the content chunk may also include the location of the target content chunk in the document. The characteristics of the content blocks are different, and the specific implementation of the first step is also not completely the same. The following part of the present invention will be divided into three examples, and the specific embodiment of the first step will be specifically described with reference to the features of the content block.
In one example, the specific implementation of extracting the features of the content blocks of the document to be retrieved includes: and if the characteristics of the content block comprise word characteristics of the content block, preprocessing the content block. And then acquiring word characteristics of the preprocessed content blocks.
First, for the preprocessing operation, if the document to be retrieved is a chinese document, the preprocessing may include a word segmentation operation, a word segmentation may be performed by using a jieba (jieba) word segmentation technique, and other suitable word segmentation methods other than the jieba word segmentation method may be selected according to a specific working scenario and a working requirement, which is not limited herein. In the process of word segmentation, the set of characters of all cells in the table can be used as the character content of the table, and word segmentation processing is performed on the character content of the table. It should be noted that, if the document to be retrieved is a chinese document, word segmentation may be performed on the content block instead of the content block. For example, a word segmentation operation may be performed using an n-gram model. For example, if the content block includes a text content "content block search model", and n in the n-gram model takes a value of 2, the content block can be divided into information, extraction, module extraction, and model after the word segmentation operation is performed.
The preprocessing may also include operations to remove redundant text. Specifically, the lowest word frequency and/or the highest word frequency may be set. By setting the lowest word frequency, if the frequency of a certain word appearing in the document to be retrieved is lower than the lowest word frequency, the low-frequency word can be deleted from the content block, so that the unusual words, stop words and misspelled words in the document can be removed. By setting the highest word frequency, if the frequency of a certain word appearing in the document to be retrieved is higher than the highest word frequency, the high-frequency word can be deleted from the content block, so that words without actual meanings such as mood auxiliary words, structure auxiliary words and the like in the document can be removed, such as 'yes' and 'yes'. By removing the words, the accuracy of relevancy scoring can be improved.
Secondly, for the word feature of the content block, the word feature of the content block may be calculated based on a term frequency-inverse document frequency (tfidf) algorithm, and at this time, the word feature of the content block may be a word vector sparse matrix. Specifically, for each entry of each content block, a tfidf algorithm may be utilized to extract a first word feature characterizing word features of the entry in a table and a second word feature characterizing word features of the entry in a non-table. Illustratively, the word feature may be a tfidf value. In addition, the text content of the preprocessed content block can be calculated by using a tfidf algorithm, so that a word vector sparse matrix of the content block is obtained. It should be noted that, if the text in the target document is not chinese, the text block may also be processed directly by using the tfidf algorithm without preprocessing the content block, so as to obtain the word vector sparse matrix of the content block. In addition, other algorithms may be used to obtain the word features of the content block, which is not limited to this.
In another example, if the feature of the content block includes a context word feature of the content block, the context word feature of the content block is obtained based on the content block and an adjacent content block of the content block. The word features of the adjacent content blocks can be spliced to obtain the context word features of the content blocks. In addition, other algorithms may be used to obtain the context word features of the content block, which is not limited to this.
In yet another example, if the feature of the content block includes a word feature of a superordinate title of the content block, the word feature of the superordinate title of the content block is obtained based on the text content of the superordinate title of the content block. If the content block is not a title, the upper level title of the content block comprises a title corresponding to the content block and an upper level title of the title corresponding to the content block. When calculating the word feature of the upper title of the content block, the word feature of the upper title may be calculated based on the text content of the title corresponding to the content block and the text content of the upper title of the title corresponding to the content block. Alternatively, if the content block corresponds to the third-level title "seven" and the merged financial statement item annotation, "the word feature of the upper-level title of the title corresponding to the content block may be calculated from the text content of the content block, the third-level title" seven ", the merged financial statement item annotation," the second-level title "eleventh section, the financial report," and the first-level title "second chapter, and the fixed asset. The method for calculating the word features of the upper-level title of the content block is similar to the method for calculating the word features of the preprocessed content block in the first example, and is not described again.
In a specific example, in order to accurately extract the word feature of the upper title of the content block, the upper title of the content block needs to be accurately acquired. The manner of obtaining the upper title of the content block may include: the upper level title of the content block is determined according to the logical tree of content blocks.
First, the structure of the title logical tree may be as shown in fig. 2 with respect to the specific structure of the title logical tree. The title logical tree is composed of root nodes R0And child node A1-A7The first sub-tree, sub-node A8-A13Second sub-tree, sub-node A of composition14-A19A third subtree of which A1、A8、A14Is directly linked to R0Three child nodes. The three subtrees have no direct connection relation with each other.
In the logical tree of headings shown in FIG. 2, the root node R0May be the topic name of the document or the topic of the document. Still alternatively, the root node R shown in FIG. 20Or may be left vacant, i.e. the root node R0And is not used to represent the hierarchical structure of a directory. All the child nodes constituting the three subtrees are titles. For any child node in the subtree, the parent node is the upper-level title, and the child node is the lower-level title. For example, child node A1Is the first level header, child node A2Is the second level title under the first level title.
Correspondingly, for any child node in the title logic tree, determining the upper-level title of the content block according to the content block logic tree specifically includes: connecting the root node of the element logic tree with the taskNodes between the meaning sub-nodes are all superior nodes of the arbitrary sub-nodes. E.g. for child node a6The upper node comprises A1、A3And A5
Second, the method of constructing a logical tree of headers may include a first sub-step and a second sub-step as follows.
The method comprises the following specific steps:
the first substep, obtain the title ordered sequence of the file to be retrieved. The front and back order of each title in the title ordered sequence is the same as the appearance order of each title in the document to be retrieved. Illustratively, the titles A are sequentially arranged according to the sequence of the titles appearing in the document to be retrieved1Title A2… …, title AmWherein the subscript of each title indicates the chronological order in which the title appears in the document. The title ordered sequence is title a1Title A2… …, title Am}. Wherein m is a positive integer.
The first sub-step, regarding the titles in the ordered sequence of titles as first titles in turn, and for each first title, performing the following operations.
First, if a subtree in which a previous title of a first title is located in a logical tree of titles has a second title which is at the same level as the first title, the first title is used as a peer node of the second title, and a parent node of the peer node of the second title is the same as a parent node of the second title. And judging whether the first title and the second title are the same level or not through the title distinguishing model. For example, the heading discrimination model may include a feed Forward Neural Network (FNN) model and a second Softmax classifier.
In order to fully understand the first substep, the following sections of the present invention are specifically described with reference to FIG. 2 for the first substep. Continuing with FIG. 2, if title A14As the first title, its preceding title A13The corresponding sub-tree is composed of sub-node A8And with child node A8All child nodes A connected directly or indirectly9To A14And (4) forming a subtree. Then it needs to be at child node a8To A13Inter determinationWhether or not A is present14The sibling title of (1). If A is11And A14At the same level, the same level will be A14Is determined as A10A child node of14Is connected to A10Below. If child node A8Is A14The same level of title of (A) is14And root node R0Are connected. At this time, A14As the starting node of the third subtree in the logical tree of the header.
And secondly, if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title. Continuing with the previous example, if child node A8To A13Are not A14The same level title component of A14Is determined as A13A child node of14Is connected to A13Below.
After the features of the content block are extracted in the first step, the S110 may further include a second step of inputting the features of the content block to the relevance score model to obtain a relevance score between the content block and the key sentence. For the relevant description of the relevancy score and the relevancy score model, reference may be made to the relevant contents in the above embodiments of the present invention, and details are not described herein again.
And S120, determining a target content block related to the key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence.
In some embodiments, if the document to be retrieved includes M content blocks, the top N content blocks with the highest relevance scores among the M content blocks may be determined as the target content blocks. Wherein N is a positive integer not greater than M. For example, N may be set according to a specific work scenario and work requirement, for example, N is equal to 100, which is not limited in this respect.
S130, the target content block is used as a content block retrieval result of the key sentence in the document to be retrieved.
According to the content block retrieval method based on the key sentences, the relevance grade of the content block in the document to be retrieved and the key sentences can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the key sentence based on the relevancy score. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.
In some embodiments, the target content blocks may be displayed on the display interface in order of high to low relevancy scores of the target content blocks to the key sentences.
In some embodiments, the location features of the target content piece may be extracted to quickly locate the target content piece when desired by the user. The position characteristic of the target content block may be a ratio of the number of pages of the content block in the document to be retrieved to the total number of pages of the document to be retrieved. For example, if the target content block is on page 7 of the document to be retrieved, and the document to be retrieved has a total of 12 pages, the value of the position characteristic of the target content block is 7/12.
An apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Based on the same inventive concept, the content block retrieval device based on the key sentence provided by the embodiment of the invention. Fig. 3 is a schematic structural diagram of a content block retrieval apparatus based on a key statement according to an embodiment of the present invention. As shown in fig. 3, the key sentence-based content block retrieval apparatus 300 includes:
the calculating module 310 is configured to obtain a relevance score between a content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence. Wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture.
And a determining module 320, configured to determine a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences.
And the result processing module 330 is configured to use the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
In some embodiments of the present invention, the calculation module 310 includes an extraction unit and a scoring unit.
The extraction unit is used for extracting the characteristics of the content blocks of the document to be retrieved.
And the scoring unit is used for inputting the characteristics of the content block into the relevance scoring model to obtain the relevance score of the content block and the key sentence.
In some embodiments of the invention, the characteristics of the content chunk include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
In some embodiments, the extraction unit is specifically configured to: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.
In some embodiments, the extraction unit is specifically configured to: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.
In some embodiments, the key sentence based content block retrieval apparatus 300 further includes an obtaining module and a logical tree generating module.
The acquisition module is used for acquiring the ordered sequence of the titles of the documents to be retrieved.
And the logical tree generating module is used for sequentially taking the titles of the sequence in the title order as the first title. And, for each first title, performing the following operations: and if a subtree in which the previous title of the first title is located in the logical tree of the titles has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein the parent node of the peer node of the second title is the same as the parent node of the second title.
And if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
In some embodiments of the present invention, the content block retrieval apparatus 300 based on key sentences further comprises a model training module.
The model training module is used for executing the following operations aiming at the relevance scoring model of the key sentences: and marking the content block samples related to the key sentences as positive samples, and marking the content block samples unrelated to the key sentences as negative samples. And training a relevance scoring model of the key sentences by using the positive samples and the negative samples.
In some embodiments of the present invention, the determining module 320 is specifically configured to: and determining the first N content blocks with the highest relevancy scores as target content blocks.
Other details of the content block retrieval device based on the key sentence according to the embodiment of the present invention are similar to those of the content block retrieval method based on the key sentence according to the embodiment of the present invention described above with reference to fig. 1 to 2, and can achieve the corresponding technical effects, which are not described herein again.
Fig. 4 is a block diagram of an exemplary hardware architecture of a content block retrieval device based on a key sentence in the embodiment of the present invention.
As shown in fig. 4, the key sentence based content block retrieval device 400 includes an input device 401, an input interface 402, a central processor 403, a memory 404, an output interface 405, and an output device 406. The input interface 402, the central processing unit 403, the memory 404, and the output interface 405 are connected to each other through a bus 410, and the input device 401 and the output device 406 are connected to the bus 410 through the input interface 402 and the output interface 405, respectively, and further connected to other components of the content block retrieval device 400 based on the key sentence.
Specifically, the input device 401 receives input information from the outside and transmits the input information to the central processor 403 through the input interface 402; the central processor 403 processes the input information based on computer-executable instructions stored in the memory 404 to generate output information, stores the output information temporarily or permanently in the memory 404, and then transmits the output information to the output device 406 through the output interface 405; the output device 406 outputs the output information to the outside of the content block retrieval device 400 based on the key sentence for use by the user.
That is, the key sentence-based content block retrieval device shown in fig. 4 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus of the key sentence-based content block retrieval device described in conjunction with fig. 1-2.
In one embodiment, the key sentence based content block retrieval device 400 shown in fig. 4 may be implemented as a device that may include: a memory for storing a program; and the processor is used for operating the program stored in the memory to execute the content block retrieval method based on the key statement of the embodiment of the invention.
The embodiment of the invention also provides a computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when being executed by a processor, the computer program instructions realize the content block retrieval method based on the key statement.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
As will be apparent to those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Claims (11)

1. A content block retrieval method based on key sentences is characterized by comprising the following steps:
obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;
determining a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences;
and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
2. The method of claim 1,
the obtaining of the relevance score between the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence comprises:
extracting the characteristics of the content blocks of the document to be retrieved;
and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.
3. The method of claim 2, wherein the characteristics of the content block comprise at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
4. The method of claim 2,
the extracting the characteristics of the content block of the document to be retrieved comprises the following steps:
if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation;
if the characteristics comprise the context word characteristics of the content block, obtaining the context word characteristics of the content block based on the content block and the adjacent content block of the content block;
and if the characteristics comprise the word characteristics of the superior titles of the content blocks, obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.
5. The method of claim 4, wherein obtaining the word feature of the superior title of the content block based on the superior title of the content block comprises:
determining an upper level title of the content block based on a title logical tree;
and obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.
6. The method of claim 5, further comprising:
acquiring a title ordered sequence of the document to be retrieved;
sequentially taking the titles of the sequences in the title order as first titles;
for each first title, performing the following operations:
if a subtree where a previous title of the first title is located in the title logical tree has a second title at the same level as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title;
and if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
7. The method of claim 1, further comprising:
aiming at the relevance scoring model of the key sentences, the following operations are carried out:
marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample;
and training a relevancy scoring model of the key sentences by using the positive samples and the negative samples.
8. The method of claim 1, wherein determining a target content block related to a key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence comprises:
and determining the first N content blocks with the highest relevancy scores as the target content blocks.
9. An apparatus for retrieving a content block based on a key sentence, the apparatus comprising:
the calculation module is used for obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;
a determining module, configured to determine, from the content blocks, target content blocks related to the key sentences based on the relevancy scores of the content blocks and the key sentences;
and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
10. A content block retrieval device based on a key sentence, characterized by comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to execute the key sentence based content block retrieval method of any one of claims 1 to 8.
11. A computer storage medium having computer program instructions stored thereon, which when executed by a processor, implement the key sentence-based content block retrieval method of any one of claims 1-8.
CN202010345947.1A 2020-04-27 2020-04-27 Key sentence based content block retrieval method, device, equipment and medium Pending CN113641783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010345947.1A CN113641783A (en) 2020-04-27 2020-04-27 Key sentence based content block retrieval method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010345947.1A CN113641783A (en) 2020-04-27 2020-04-27 Key sentence based content block retrieval method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN113641783A true CN113641783A (en) 2021-11-12

Family

ID=78415214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010345947.1A Pending CN113641783A (en) 2020-04-27 2020-04-27 Key sentence based content block retrieval method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113641783A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108266A1 (en) * 2003-11-14 2005-05-19 Microsoft Corporation Method and apparatus for browsing document content
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN102063474A (en) * 2010-12-16 2011-05-18 西北工业大学 Semantic relevance-based XML (Extensive Makeup Language) keyword top-k inquiring method
CN105005562A (en) * 2014-04-15 2015-10-28 索意互动(北京)信息技术有限公司 Retrieval result display processing method and apparatus
CN105488151A (en) * 2015-11-27 2016-04-13 小米科技有限责任公司 Reference document recommendation method and apparatus
WO2017098341A1 (en) * 2015-12-08 2017-06-15 Kumar Damnish System and method of content tagging and indexing
CN108733766A (en) * 2018-04-17 2018-11-02 腾讯科技(深圳)有限公司 A kind of data query method, apparatus and readable medium
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
CN110134760A (en) * 2019-05-17 2019-08-16 北京思维造物信息科技股份有限公司 A kind of searching method, device, equipment and medium
CN110162764A (en) * 2018-02-12 2019-08-23 北京庖丁科技有限公司 Method for splitting, device, equipment and the medium of electronic document
CN110263345A (en) * 2019-06-26 2019-09-20 北京百度网讯科技有限公司 Keyword extracting method, device and storage medium
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108266A1 (en) * 2003-11-14 2005-05-19 Microsoft Corporation Method and apparatus for browsing document content
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN102063474A (en) * 2010-12-16 2011-05-18 西北工业大学 Semantic relevance-based XML (Extensive Makeup Language) keyword top-k inquiring method
CN105005562A (en) * 2014-04-15 2015-10-28 索意互动(北京)信息技术有限公司 Retrieval result display processing method and apparatus
CN105488151A (en) * 2015-11-27 2016-04-13 小米科技有限责任公司 Reference document recommendation method and apparatus
WO2017098341A1 (en) * 2015-12-08 2017-06-15 Kumar Damnish System and method of content tagging and indexing
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
CN110162764A (en) * 2018-02-12 2019-08-23 北京庖丁科技有限公司 Method for splitting, device, equipment and the medium of electronic document
CN108733766A (en) * 2018-04-17 2018-11-02 腾讯科技(深圳)有限公司 A kind of data query method, apparatus and readable medium
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN110134760A (en) * 2019-05-17 2019-08-16 北京思维造物信息科技股份有限公司 A kind of searching method, device, equipment and medium
CN110263345A (en) * 2019-06-26 2019-09-20 北京百度网讯科技有限公司 Keyword extracting method, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付鸿鹄;张晓林;: "基于段落检索和段落内容分析的知识化检索系统设计", 情报理论与实践, no. 05, pages 109 - 113 *
付鸿鹄;张晓林;: "段落检索及其相关算法研究", 现代图书情报技术, no. 02, pages 44 - 48 *

Similar Documents

Publication Publication Date Title
CN107229668B (en) Text extraction method based on keyword matching
CN101404015B (en) Automatically generating a hierarchy of terms
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
US7756859B2 (en) Multi-segment string search
CN103678412B (en) A kind of method and device of file retrieval
US20020021838A1 (en) Adaptively weighted, partitioned context edit distance string matching
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
US20080147641A1 (en) Method for prioritizing search results retrieved in response to a computerized search query
CN101799802A (en) Method and system for extracting entity relationship by using structural information
CN106649557A (en) Semantic association mining method for defect report and mail list
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
CN114756733A (en) Similar document searching method and device, electronic equipment and storage medium
CN113642320A (en) Method, device, equipment and medium for extracting document directory structure
CN105404677A (en) Tree structure based retrieval method
WO2016099422A2 (en) Content sensitive document ranking method by analyzing the citation contexts
CN115982390B (en) Industrial chain construction and iterative expansion development method
CN111008285B (en) Author disambiguation method based on thesis key attribute network
CN117235199A (en) Information intelligent matching retrieval method based on document tree
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN105426490A (en) Tree structure based indexing method
CN113392189B (en) News text processing method based on automatic word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination