CN113641783A - Key sentence based content block retrieval method, device, equipment and medium - Google Patents
Key sentence based content block retrieval method, device, equipment and medium Download PDFInfo
- Publication number
- CN113641783A CN113641783A CN202010345947.1A CN202010345947A CN113641783A CN 113641783 A CN113641783 A CN 113641783A CN 202010345947 A CN202010345947 A CN 202010345947A CN 113641783 A CN113641783 A CN 113641783A
- Authority
- CN
- China
- Prior art keywords
- content block
- title
- content
- key
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000012549 training Methods 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000007115 recruitment Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a content block retrieval method, a content block retrieval device, content block retrieval equipment and a content block retrieval medium based on key sentences. The method comprises the following steps: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved. According to the content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences, provided by the embodiment of the invention, the retrieval accuracy of the document can be improved.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence.
Background
In order to obtain the content required by the user in the document to be retrieved, the user is required to manually input the keyword in the document tool, so that the position related to the keyword is found in the document to be retrieved. Taking the WORD document as an example, the search result of the keyword can be determined in the document by using the self-contained "search" function of the WORD document. Such as the sentence in which the keyword is located. The retrieval accuracy is low.
Disclosure of Invention
The content block retrieval method, the content block retrieval device, the content block retrieval equipment and the content block retrieval medium based on the key sentences can improve the content block retrieval accuracy of the document.
In a first aspect, a method, an apparatus, a device and a medium method for retrieving a content block based on a key statement are provided, which includes: obtaining a relevance score of a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
According to the content block retrieval method based on the key sentences, the relevance grade of the content block and the key sentences in the document to be retrieved can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the document to be retrieved based on the relevancy scoring model. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.
In an optional implementation manner, obtaining a relevance score between a content block of a document to be retrieved and a key sentence based on a relevance score model of the key sentence includes: extracting the characteristics of the content blocks of the document to be retrieved; and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.
In this example, by extracting the features of the content block and then calculating the relevancy score between the content block and the key sentence using the features of the content block, the calculation accuracy of the relevancy score can be ensured and the calculation speed can be increased.
In an alternative embodiment, the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
In the example, by calculating the word features of the content blocks, the text features of the content blocks can be accurately identified, and the calculation accuracy of the relevancy scores is improved. By using the context word features of the content block, the calculation accuracy of the relevancy score can be improved according to the word features of the surrounding content blocks of the content block. Through the word characteristics of the superior titles of the titles corresponding to the content blocks, the relevancy score can be calculated according to the relevancy between the content blocks and the superior titles, and the calculation accuracy of the relevancy score is improved.
In an alternative embodiment, the extracting the feature of the content block of the document to be retrieved includes: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.
In this example, the features of the content block can be accurately obtained, thereby ensuring the accuracy of the correlation calculation.
In an alternative embodiment, obtaining the word feature of the upper level title of the content block based on the upper level title of the content block includes: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.
In this example, since the title corresponding to any node in the title logical tree is the higher-level title of the title corresponding to the child node of the node, the hierarchical relationship between the title components can be determined by establishing the title logical tree, so that the accuracy of the higher-level title of the content block can be accurately obtained, and the calculation accuracy of the relevancy score can be improved.
In an optional embodiment, the method further comprises: acquiring a title ordered sequence of a document to be retrieved; sequentially taking the titles of the sequences in the title order as first titles; for each first title, performing the following operations: if a subtree in which a previous title of a first title in the title logical tree is located has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title; and if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
In the method for generating the directory structure by using the title template in the prior art, the number of the hierarchies of the target structure is the same as that set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method in the embodiment of the invention, whether the title component is the same level or not can be compared with the title component added to the logical tree of the title, and if the different level is different, the different level is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with a method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.
In an optional embodiment, the method further comprises: aiming at the relevance scoring model of the key sentences, the following operations are carried out: marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample; and training a relevance scoring model of the key sentences by using the positive samples and the negative samples.
The relevance scoring model is trained by a method of marking the content blocks as positive and negative samples, so that the accuracy of the model can be improved.
In an alternative embodiment, determining a target content block related to a key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences includes: and determining the first N content blocks with the highest relevancy scores as target content blocks.
In this embodiment, since the relevance score can accurately represent the relevance between the content block and the sentence to be retrieved, the retrieval accuracy can be improved by calculating the relevance score. In addition, the target content blocks with low relevance with the sentence to be retrieved are screened out, so that the content block retrieval efficiency can be improved, and the content block retrieval result expected by the user can be retrieved.
In a second aspect, a content block retrieval apparatus based on a key sentence is provided, including: the calculation module is used for obtaining a relevance score of a content block of a document to be retrieved and the key sentences based on the relevance score model of the key sentences, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture; the determining module is used for determining a target content block related to the key sentences from the content blocks based on the relevancy scores of the content blocks and the key sentences; and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
In a third aspect, a content block retrieval device based on a key sentence is provided, including: a memory for storing a program; a processor, configured to execute a program stored in the memory to execute the method for retrieving a content block based on a key statement provided in the first aspect or any optional implementation manner of the first aspect.
In a fourth aspect, a computer storage medium is provided, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the method for retrieving a content block based on a key sentence provided in the first aspect or any optional implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating a key sentence based content block retrieval method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an exemplary logical tree of elements, in accordance with an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a content block retrieval apparatus based on a key sentence according to an embodiment of the present invention;
fig. 4 is a block diagram of an exemplary hardware architecture of a content block retrieval device based on a key sentence in the embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the invention provides a content block retrieval scheme based on retrieval sentences, which is suitable for a specific scene of inputting the sentences to be retrieved in a document to perform key content block retrieval of the document content. The method is particularly suitable for a specific scene of searching the document with a specific text structure. Such as a survey of complex financial information texts including a survey instruction, a bond recruitment instruction, an annual report, a financial report, a merger and merger report, a rating report, a research report, a legal contract document, and public opinion news. The method is particularly suitable for searching the content blocks in the document. After the key sentences are obtained, the relevance grade of each content block in the document to be retrieved and the key sentences can be calculated, the target content blocks relevant to the key sentences are determined from the document to be retrieved according to the relevance grade, and the target content blocks are used as the content block retrieval results of the key sentences in the document to be retrieved.
In the embodiment of the invention, the document to be retrieved refers to an electronic document capable of acquiring the text and diagram information of the document. Specifically, it may be an electronic document in a WORD format, a PDF format, TXT, or the like. In addition, the document to be retrieved can be regarded as being composed of a plurality of paragraphs, wherein a table, a picture, a chart, a title and the like can be regarded as one paragraph respectively. Therefore, the document to be retrieved can be divided into a plurality of content blocks independent of each other in units of paragraphs. That is, the content block of the document to be retrieved includes at least one of a text content paragraph, a title, a table, a chart, and a picture.
Often, a multi-level title is often provided within the document to be retrieved. The first-level title, the second-level title, the third-level title and the like are arranged in the order from high to low in the hierarchy. There are often multiple low-level titles under the high-level title, with multiple low-level titles being subordinate to the high-level title. For the L < th > level title, it is subordinate to the top L-1 level title. For the L-th layer title, the top L-1 level titles to which the L-th layer title belongs are all top level titles. Illustratively, if there is a five-level heading "(1) in the second chapter of the document to be retrieved, the fixed asset case" has the following upper-level headings in the order of the hierarchy from low to high: four-level title "19, fixed asset", three-level title "seven, merge financial statement item comments", secondary title "eleventh section, financial report", primary title "chapter two, fixed asset". For ease of understanding, the following embodiments of the present invention will continue to be exemplarily described using the above-described five-level headings.
Since titles tend to be a high summarization of the content of one or more successive text content passages, each title tends to be followed immediately by one or more successive blocks of content, such as text content passages, pictures, charts, tables, and the like. In the embodiment of the present invention, it may be considered that a content block immediately following a certain title has a correspondence with the title. Illustratively, if the content blocks appear in the document to be retrieved, the sequence is three-level title A31Text content paragraph B2Table C1Graph D1Four-level title A41Document content paragraph B3Text content paragraph B4Title three level A32. Then document content paragraph B2Table C1Graph D1Corresponding to the three-level title A31Document content paragraph B3Document content paragraph B4Corresponding to four levels of title A41。
In order to better understand the technical solution of the embodiment of the present invention, a method, an apparatus, a device, and a medium for retrieving a content block based on a key sentence according to the embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart illustrating a key sentence-based content block retrieval method according to an embodiment of the present invention. As shown in fig. 1, the content block retrieval method 100 based on key sentences in the present embodiment may include the following steps:
s110, obtaining the relevance score of the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence.
First, for the key sentences, before S110, a plurality of key sentences may be preset in advance, and then one of the key sentences may be selected as the key sentence. The key sentences may characterize what the user desires to retrieve from the document to be retrieved. For example, for the instruction for stock, the key sentence may be "net profit", "business income of main business", or the like, which the user desires. The key sentence can be set according to the actual requirement and the application scene of the document to be retrieved, and the method is not limited. In addition, the key sentence may be composed of at least one complete sentence or at least one word, which is not particularly limited.
Next, a relevance score is given to the content block and the key sentence, and the relevance score is used for indicating the relevance degree of the content block and the key sentence. The higher the relevance score, the higher the relevance of the content block to the key sentence. For example, if the document to be retrieved includes M content blocks, the relevance scores of the M content blocks and the key sentences may be calculated respectively. Optionally, the relevance score ranges from [0,1 ]. In some embodiments, the relevance score may be determined based on a pre-trained relevance score model of the key-sentences.
Then, for the relevance scoring model of the key sentence, the relevance scoring module may select a Gradient Boosting Decision Tree (GBDT) Regression model or a Logistic Regression module. Preferably, in order to take account of the speed and accuracy of score calculation, the correlation score model uses a logistic regression module.
In the process of training the relevancy rating model, first, L content block samples may be selected in advance, the content block samples related to the key sentence in the L content block samples are marked as positive samples, and the content block samples unrelated to the key sentence in the L content block samples are marked as negative samples. And training a relevance grade model of the key sentences by using the positive samples and the negative samples. For the L content block samples, K training documents may be selected, and all content blocks in the K training documents are used as content block samples. The sum of the number of the content blocks of the K training documents is L. For positive and negative examples, the label of the positive example may be set to 1 and the label of the negative example may be set to 0. Specifically, for positive samples, the expected prediction value of the correlation score is set to 1, and for negative samples, the expected prediction value of the correlation score is set to 0. Optionally, if the relevance score is calculated by using the feature of the content block, and the feature includes R sub-features, the training data for training the relevance score model may be specifically implemented as a two-dimensional data matrix, where the two-dimensional data feature includes L rows and R columns.
In addition, it is considered that several key sentences may be preset in advance before S110, and one of them is selected as a key sentence. In order to ensure the calculation accuracy and the calculation speed of the relevancy score, a relevancy score model of each key sentence in the key sentences can be established. After determining the key sentences, a relevance scoring module of the key sentences can be selected from the trained relevance scoring models. The training method of each key sentence is similar to the training method of the relevancy scoring model of the key sentence, and is not described herein again.
In one example, if the financial information text needs to be retrieved by using the content block retrieval scheme based on the retrieval statement provided by the embodiment of the invention, the data text disclosed by each large stock exchange can be used as a training document. In order to verify the accuracy of the relevance score model trained by the embodiment of the invention, the relevance score model of each key sentence can be evaluated by using the recall rate of the 5 content blocks most relevant to the key sentence in the training text as an evaluation index. If P key sentences are included in total, the average value of the evaluation indexes of the P key sentences may be used as the total evaluation index. When data texts of a hong Kong stock exchange, a Shanghai stock exchange and a Shenzhen stock exchange are sequentially used as training samples, the total evaluation indexes of the relevancy scoring model can reach 88%, 97% and 93% respectively. The calculation accuracy of the correlation degree scoring model is extremely high.
Finally, in order to ensure the calculation accuracy of the correlation score, the features of each content block may be extracted first, and then the correlation score between the content block and the key sentence may be calculated using the features. Accordingly, a specific embodiment of S110 includes:
the first step, extract the characteristic of the content block of the file to be searched. Wherein the characteristics of the content block include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block. Still alternatively, the characteristics of the content chunk may also include the location of the target content chunk in the document. The characteristics of the content blocks are different, and the specific implementation of the first step is also not completely the same. The following part of the present invention will be divided into three examples, and the specific embodiment of the first step will be specifically described with reference to the features of the content block.
In one example, the specific implementation of extracting the features of the content blocks of the document to be retrieved includes: and if the characteristics of the content block comprise word characteristics of the content block, preprocessing the content block. And then acquiring word characteristics of the preprocessed content blocks.
First, for the preprocessing operation, if the document to be retrieved is a chinese document, the preprocessing may include a word segmentation operation, a word segmentation may be performed by using a jieba (jieba) word segmentation technique, and other suitable word segmentation methods other than the jieba word segmentation method may be selected according to a specific working scenario and a working requirement, which is not limited herein. In the process of word segmentation, the set of characters of all cells in the table can be used as the character content of the table, and word segmentation processing is performed on the character content of the table. It should be noted that, if the document to be retrieved is a chinese document, word segmentation may be performed on the content block instead of the content block. For example, a word segmentation operation may be performed using an n-gram model. For example, if the content block includes a text content "content block search model", and n in the n-gram model takes a value of 2, the content block can be divided into information, extraction, module extraction, and model after the word segmentation operation is performed.
The preprocessing may also include operations to remove redundant text. Specifically, the lowest word frequency and/or the highest word frequency may be set. By setting the lowest word frequency, if the frequency of a certain word appearing in the document to be retrieved is lower than the lowest word frequency, the low-frequency word can be deleted from the content block, so that the unusual words, stop words and misspelled words in the document can be removed. By setting the highest word frequency, if the frequency of a certain word appearing in the document to be retrieved is higher than the highest word frequency, the high-frequency word can be deleted from the content block, so that words without actual meanings such as mood auxiliary words, structure auxiliary words and the like in the document can be removed, such as 'yes' and 'yes'. By removing the words, the accuracy of relevancy scoring can be improved.
Secondly, for the word feature of the content block, the word feature of the content block may be calculated based on a term frequency-inverse document frequency (tfidf) algorithm, and at this time, the word feature of the content block may be a word vector sparse matrix. Specifically, for each entry of each content block, a tfidf algorithm may be utilized to extract a first word feature characterizing word features of the entry in a table and a second word feature characterizing word features of the entry in a non-table. Illustratively, the word feature may be a tfidf value. In addition, the text content of the preprocessed content block can be calculated by using a tfidf algorithm, so that a word vector sparse matrix of the content block is obtained. It should be noted that, if the text in the target document is not chinese, the text block may also be processed directly by using the tfidf algorithm without preprocessing the content block, so as to obtain the word vector sparse matrix of the content block. In addition, other algorithms may be used to obtain the word features of the content block, which is not limited to this.
In another example, if the feature of the content block includes a context word feature of the content block, the context word feature of the content block is obtained based on the content block and an adjacent content block of the content block. The word features of the adjacent content blocks can be spliced to obtain the context word features of the content blocks. In addition, other algorithms may be used to obtain the context word features of the content block, which is not limited to this.
In yet another example, if the feature of the content block includes a word feature of a superordinate title of the content block, the word feature of the superordinate title of the content block is obtained based on the text content of the superordinate title of the content block. If the content block is not a title, the upper level title of the content block comprises a title corresponding to the content block and an upper level title of the title corresponding to the content block. When calculating the word feature of the upper title of the content block, the word feature of the upper title may be calculated based on the text content of the title corresponding to the content block and the text content of the upper title of the title corresponding to the content block. Alternatively, if the content block corresponds to the third-level title "seven" and the merged financial statement item annotation, "the word feature of the upper-level title of the title corresponding to the content block may be calculated from the text content of the content block, the third-level title" seven ", the merged financial statement item annotation," the second-level title "eleventh section, the financial report," and the first-level title "second chapter, and the fixed asset. The method for calculating the word features of the upper-level title of the content block is similar to the method for calculating the word features of the preprocessed content block in the first example, and is not described again.
In a specific example, in order to accurately extract the word feature of the upper title of the content block, the upper title of the content block needs to be accurately acquired. The manner of obtaining the upper title of the content block may include: the upper level title of the content block is determined according to the logical tree of content blocks.
First, the structure of the title logical tree may be as shown in fig. 2 with respect to the specific structure of the title logical tree. The title logical tree is composed of root nodes R0And child node A1-A7The first sub-tree, sub-node A8-A13Second sub-tree, sub-node A of composition14-A19A third subtree of which A1、A8、A14Is directly linked to R0Three child nodes. The three subtrees have no direct connection relation with each other.
In the logical tree of headings shown in FIG. 2, the root node R0May be the topic name of the document or the topic of the document. Still alternatively, the root node R shown in FIG. 20Or may be left vacant, i.e. the root node R0And is not used to represent the hierarchical structure of a directory. All the child nodes constituting the three subtrees are titles. For any child node in the subtree, the parent node is the upper-level title, and the child node is the lower-level title. For example, child node A1Is the first level header, child node A2Is the second level title under the first level title.
Correspondingly, for any child node in the title logic tree, determining the upper-level title of the content block according to the content block logic tree specifically includes: connecting the root node of the element logic tree with the taskNodes between the meaning sub-nodes are all superior nodes of the arbitrary sub-nodes. E.g. for child node a6The upper node comprises A1、A3And A5。
Second, the method of constructing a logical tree of headers may include a first sub-step and a second sub-step as follows.
The method comprises the following specific steps:
the first substep, obtain the title ordered sequence of the file to be retrieved. The front and back order of each title in the title ordered sequence is the same as the appearance order of each title in the document to be retrieved. Illustratively, the titles A are sequentially arranged according to the sequence of the titles appearing in the document to be retrieved1Title A2… …, title AmWherein the subscript of each title indicates the chronological order in which the title appears in the document. The title ordered sequence is title a1Title A2… …, title Am}. Wherein m is a positive integer.
The first sub-step, regarding the titles in the ordered sequence of titles as first titles in turn, and for each first title, performing the following operations.
First, if a subtree in which a previous title of a first title is located in a logical tree of titles has a second title which is at the same level as the first title, the first title is used as a peer node of the second title, and a parent node of the peer node of the second title is the same as a parent node of the second title. And judging whether the first title and the second title are the same level or not through the title distinguishing model. For example, the heading discrimination model may include a feed Forward Neural Network (FNN) model and a second Softmax classifier.
In order to fully understand the first substep, the following sections of the present invention are specifically described with reference to FIG. 2 for the first substep. Continuing with FIG. 2, if title A14As the first title, its preceding title A13The corresponding sub-tree is composed of sub-node A8And with child node A8All child nodes A connected directly or indirectly9To A14And (4) forming a subtree. Then it needs to be at child node a8To A13Inter determinationWhether or not A is present14The sibling title of (1). If A is11And A14At the same level, the same level will be A14Is determined as A10A child node of14Is connected to A10Below. If child node A8Is A14The same level of title of (A) is14And root node R0Are connected. At this time, A14As the starting node of the third subtree in the logical tree of the header.
And secondly, if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title. Continuing with the previous example, if child node A8To A13Are not A14The same level title component of A14Is determined as A13A child node of14Is connected to A13Below.
After the features of the content block are extracted in the first step, the S110 may further include a second step of inputting the features of the content block to the relevance score model to obtain a relevance score between the content block and the key sentence. For the relevant description of the relevancy score and the relevancy score model, reference may be made to the relevant contents in the above embodiments of the present invention, and details are not described herein again.
And S120, determining a target content block related to the key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence.
In some embodiments, if the document to be retrieved includes M content blocks, the top N content blocks with the highest relevance scores among the M content blocks may be determined as the target content blocks. Wherein N is a positive integer not greater than M. For example, N may be set according to a specific work scenario and work requirement, for example, N is equal to 100, which is not limited in this respect.
S130, the target content block is used as a content block retrieval result of the key sentence in the document to be retrieved.
According to the content block retrieval method based on the key sentences, the relevance grade of the content block in the document to be retrieved and the key sentences can be calculated by utilizing the relevance grade model of the key sentences. And selecting a target content block related to the key sentence from the key sentence based on the relevancy score. Because the relevancy score can accurately represent the relevancy of the content block and the key sentence, compared with a method for searching the key word in the document to be searched, the method can improve the retrieval accuracy of the content block.
In some embodiments, the target content blocks may be displayed on the display interface in order of high to low relevancy scores of the target content blocks to the key sentences.
In some embodiments, the location features of the target content piece may be extracted to quickly locate the target content piece when desired by the user. The position characteristic of the target content block may be a ratio of the number of pages of the content block in the document to be retrieved to the total number of pages of the document to be retrieved. For example, if the target content block is on page 7 of the document to be retrieved, and the document to be retrieved has a total of 12 pages, the value of the position characteristic of the target content block is 7/12.
An apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Based on the same inventive concept, the content block retrieval device based on the key sentence provided by the embodiment of the invention. Fig. 3 is a schematic structural diagram of a content block retrieval apparatus based on a key statement according to an embodiment of the present invention. As shown in fig. 3, the key sentence-based content block retrieval apparatus 300 includes:
the calculating module 310 is configured to obtain a relevance score between a content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence. Wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture.
And a determining module 320, configured to determine a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences.
And the result processing module 330 is configured to use the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
In some embodiments of the present invention, the calculation module 310 includes an extraction unit and a scoring unit.
The extraction unit is used for extracting the characteristics of the content blocks of the document to be retrieved.
And the scoring unit is used for inputting the characteristics of the content block into the relevance scoring model to obtain the relevance score of the content block and the key sentence.
In some embodiments of the invention, the characteristics of the content chunk include at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
In some embodiments, the extraction unit is specifically configured to: if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation; if the characteristics comprise the contextual word characteristics of the content block, obtaining the contextual word characteristics of the content block based on the content block and the adjacent content blocks of the content block; if the characteristics include word characteristics of the upper level title of the content block, the word characteristics of the upper level title of the content block are obtained based on the upper level title of the content block.
In some embodiments, the extraction unit is specifically configured to: determining an upper level title of the content block based on the title logical tree; the word feature of the upper level title of the content block is obtained based on the upper level title of the content block.
In some embodiments, the key sentence based content block retrieval apparatus 300 further includes an obtaining module and a logical tree generating module.
The acquisition module is used for acquiring the ordered sequence of the titles of the documents to be retrieved.
And the logical tree generating module is used for sequentially taking the titles of the sequence in the title order as the first title. And, for each first title, performing the following operations: and if a subtree in which the previous title of the first title is located in the logical tree of the titles has a second title which is the same as the first title, taking the first title as a peer node of the second title, wherein the parent node of the peer node of the second title is the same as the parent node of the second title.
And if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
In some embodiments of the present invention, the content block retrieval apparatus 300 based on key sentences further comprises a model training module.
The model training module is used for executing the following operations aiming at the relevance scoring model of the key sentences: and marking the content block samples related to the key sentences as positive samples, and marking the content block samples unrelated to the key sentences as negative samples. And training a relevance scoring model of the key sentences by using the positive samples and the negative samples.
In some embodiments of the present invention, the determining module 320 is specifically configured to: and determining the first N content blocks with the highest relevancy scores as target content blocks.
Other details of the content block retrieval device based on the key sentence according to the embodiment of the present invention are similar to those of the content block retrieval method based on the key sentence according to the embodiment of the present invention described above with reference to fig. 1 to 2, and can achieve the corresponding technical effects, which are not described herein again.
Fig. 4 is a block diagram of an exemplary hardware architecture of a content block retrieval device based on a key sentence in the embodiment of the present invention.
As shown in fig. 4, the key sentence based content block retrieval device 400 includes an input device 401, an input interface 402, a central processor 403, a memory 404, an output interface 405, and an output device 406. The input interface 402, the central processing unit 403, the memory 404, and the output interface 405 are connected to each other through a bus 410, and the input device 401 and the output device 406 are connected to the bus 410 through the input interface 402 and the output interface 405, respectively, and further connected to other components of the content block retrieval device 400 based on the key sentence.
Specifically, the input device 401 receives input information from the outside and transmits the input information to the central processor 403 through the input interface 402; the central processor 403 processes the input information based on computer-executable instructions stored in the memory 404 to generate output information, stores the output information temporarily or permanently in the memory 404, and then transmits the output information to the output device 406 through the output interface 405; the output device 406 outputs the output information to the outside of the content block retrieval device 400 based on the key sentence for use by the user.
That is, the key sentence-based content block retrieval device shown in fig. 4 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus of the key sentence-based content block retrieval device described in conjunction with fig. 1-2.
In one embodiment, the key sentence based content block retrieval device 400 shown in fig. 4 may be implemented as a device that may include: a memory for storing a program; and the processor is used for operating the program stored in the memory to execute the content block retrieval method based on the key statement of the embodiment of the invention.
The embodiment of the invention also provides a computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when being executed by a processor, the computer program instructions realize the content block retrieval method based on the key statement.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
As will be apparent to those skilled in the art, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Claims (11)
1. A content block retrieval method based on key sentences is characterized by comprising the following steps:
obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;
determining a target content block related to the key sentence from the content blocks based on the relevancy scores of the content blocks and the key sentences;
and taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
2. The method of claim 1,
the obtaining of the relevance score between the content block of the document to be retrieved and the key sentence based on the relevance score model of the key sentence comprises:
extracting the characteristics of the content blocks of the document to be retrieved;
and inputting the characteristics of the content block into the relevance grade model to obtain the relevance grade of the content block and the key sentence.
3. The method of claim 2, wherein the characteristics of the content block comprise at least one of: the word feature of the content block, the context word feature of the content block and the word feature of the superior title of the title corresponding to the content block.
4. The method of claim 2,
the extracting the characteristics of the content block of the document to be retrieved comprises the following steps:
if the characteristics comprise word characteristics of the content block, performing preprocessing operation on the content block, and acquiring the word characteristics of the preprocessed content block, wherein the preprocessing operation comprises word segmentation operation and/or redundant character removal operation;
if the characteristics comprise the context word characteristics of the content block, obtaining the context word characteristics of the content block based on the content block and the adjacent content block of the content block;
and if the characteristics comprise the word characteristics of the superior titles of the content blocks, obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.
5. The method of claim 4, wherein obtaining the word feature of the superior title of the content block based on the superior title of the content block comprises:
determining an upper level title of the content block based on a title logical tree;
and obtaining the word characteristics of the superior titles of the content blocks based on the superior titles of the content blocks.
6. The method of claim 5, further comprising:
acquiring a title ordered sequence of the document to be retrieved;
sequentially taking the titles of the sequences in the title order as first titles;
for each first title, performing the following operations:
if a subtree where a previous title of the first title is located in the title logical tree has a second title at the same level as the first title, taking the first title as a peer node of the second title, wherein a father node of the peer node of the second title is the same as a father node of the second title;
and if the subtree where the previous title is located does not have the second title, taking the first title as a child node of the previous title.
7. The method of claim 1, further comprising:
aiming at the relevance scoring model of the key sentences, the following operations are carried out:
marking the content block sample related to the key statement as a positive sample, and marking the content block sample unrelated to the key statement as a negative sample;
and training a relevancy scoring model of the key sentences by using the positive samples and the negative samples.
8. The method of claim 1, wherein determining a target content block related to a key sentence from the content blocks based on the relevance scores of the content blocks and the key sentence comprises:
and determining the first N content blocks with the highest relevancy scores as the target content blocks.
9. An apparatus for retrieving a content block based on a key sentence, the apparatus comprising:
the calculation module is used for obtaining a content block of a document to be retrieved and a relevance score of the key sentence based on the relevance score model of the key sentence, wherein the content block comprises at least one of a text paragraph, a title, a table, a chart and a picture;
a determining module, configured to determine, from the content blocks, target content blocks related to the key sentences based on the relevancy scores of the content blocks and the key sentences;
and the result processing module is used for taking the target content block as a content block retrieval result of the key sentence in the document to be retrieved.
10. A content block retrieval device based on a key sentence, characterized by comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to execute the key sentence based content block retrieval method of any one of claims 1 to 8.
11. A computer storage medium having computer program instructions stored thereon, which when executed by a processor, implement the key sentence-based content block retrieval method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010345947.1A CN113641783A (en) | 2020-04-27 | 2020-04-27 | Key sentence based content block retrieval method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010345947.1A CN113641783A (en) | 2020-04-27 | 2020-04-27 | Key sentence based content block retrieval method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113641783A true CN113641783A (en) | 2021-11-12 |
Family
ID=78415214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010345947.1A Pending CN113641783A (en) | 2020-04-27 | 2020-04-27 | Key sentence based content block retrieval method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113641783A (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050108266A1 (en) * | 2003-11-14 | 2005-05-19 | Microsoft Corporation | Method and apparatus for browsing document content |
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN102063474A (en) * | 2010-12-16 | 2011-05-18 | 西北工业大学 | Semantic relevance-based XML (Extensive Makeup Language) keyword top-k inquiring method |
CN105005562A (en) * | 2014-04-15 | 2015-10-28 | 索意互动(北京)信息技术有限公司 | Retrieval result display processing method and apparatus |
CN105488151A (en) * | 2015-11-27 | 2016-04-13 | 小米科技有限责任公司 | Reference document recommendation method and apparatus |
WO2017098341A1 (en) * | 2015-12-08 | 2017-06-15 | Kumar Damnish | System and method of content tagging and indexing |
CN108733766A (en) * | 2018-04-17 | 2018-11-02 | 腾讯科技(深圳)有限公司 | A kind of data query method, apparatus and readable medium |
CN109219811A (en) * | 2016-05-23 | 2019-01-15 | 微软技术许可有限责任公司 | Relevant paragraph searching system |
CN110134760A (en) * | 2019-05-17 | 2019-08-16 | 北京思维造物信息科技股份有限公司 | A kind of searching method, device, equipment and medium |
CN110162764A (en) * | 2018-02-12 | 2019-08-23 | 北京庖丁科技有限公司 | Method for splitting, device, equipment and the medium of electronic document |
CN110263345A (en) * | 2019-06-26 | 2019-09-20 | 北京百度网讯科技有限公司 | Keyword extracting method, device and storage medium |
CN110532834A (en) * | 2018-05-24 | 2019-12-03 | 北京庖丁科技有限公司 | Table extracting method, device, equipment and medium based on rich text format document |
-
2020
- 2020-04-27 CN CN202010345947.1A patent/CN113641783A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050108266A1 (en) * | 2003-11-14 | 2005-05-19 | Microsoft Corporation | Method and apparatus for browsing document content |
CN101377777A (en) * | 2007-09-03 | 2009-03-04 | 北京百问百答网络技术有限公司 | Automatic inquiring and answering method and system |
CN102063474A (en) * | 2010-12-16 | 2011-05-18 | 西北工业大学 | Semantic relevance-based XML (Extensive Makeup Language) keyword top-k inquiring method |
CN105005562A (en) * | 2014-04-15 | 2015-10-28 | 索意互动(北京)信息技术有限公司 | Retrieval result display processing method and apparatus |
CN105488151A (en) * | 2015-11-27 | 2016-04-13 | 小米科技有限责任公司 | Reference document recommendation method and apparatus |
WO2017098341A1 (en) * | 2015-12-08 | 2017-06-15 | Kumar Damnish | System and method of content tagging and indexing |
CN109219811A (en) * | 2016-05-23 | 2019-01-15 | 微软技术许可有限责任公司 | Relevant paragraph searching system |
CN110162764A (en) * | 2018-02-12 | 2019-08-23 | 北京庖丁科技有限公司 | Method for splitting, device, equipment and the medium of electronic document |
CN108733766A (en) * | 2018-04-17 | 2018-11-02 | 腾讯科技(深圳)有限公司 | A kind of data query method, apparatus and readable medium |
CN110532834A (en) * | 2018-05-24 | 2019-12-03 | 北京庖丁科技有限公司 | Table extracting method, device, equipment and medium based on rich text format document |
CN110134760A (en) * | 2019-05-17 | 2019-08-16 | 北京思维造物信息科技股份有限公司 | A kind of searching method, device, equipment and medium |
CN110263345A (en) * | 2019-06-26 | 2019-09-20 | 北京百度网讯科技有限公司 | Keyword extracting method, device and storage medium |
Non-Patent Citations (2)
Title |
---|
付鸿鹄;张晓林;: "基于段落检索和段落内容分析的知识化检索系统设计", 情报理论与实践, no. 05, pages 109 - 113 * |
付鸿鹄;张晓林;: "段落检索及其相关算法研究", 现代图书情报技术, no. 02, pages 44 - 48 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107229668B (en) | Text extraction method based on keyword matching | |
CN101404015B (en) | Automatically generating a hierarchy of terms | |
CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
US7756859B2 (en) | Multi-segment string search | |
CN103678412B (en) | A kind of method and device of file retrieval | |
US20020021838A1 (en) | Adaptively weighted, partitioned context edit distance string matching | |
US20080147642A1 (en) | System for discovering data artifacts in an on-line data object | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
US20080147588A1 (en) | Method for discovering data artifacts in an on-line data object | |
US20080147641A1 (en) | Method for prioritizing search results retrieved in response to a computerized search query | |
CN101799802A (en) | Method and system for extracting entity relationship by using structural information | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN113962293A (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN112989208A (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN114756733A (en) | Similar document searching method and device, electronic equipment and storage medium | |
CN113642320A (en) | Method, device, equipment and medium for extracting document directory structure | |
CN105404677A (en) | Tree structure based retrieval method | |
WO2016099422A2 (en) | Content sensitive document ranking method by analyzing the citation contexts | |
CN115982390B (en) | Industrial chain construction and iterative expansion development method | |
CN111008285B (en) | Author disambiguation method based on thesis key attribute network | |
CN117235199A (en) | Information intelligent matching retrieval method based on document tree | |
CN111985212A (en) | Text keyword recognition method and device, computer equipment and readable storage medium | |
CN105426490A (en) | Tree structure based indexing method | |
CN113392189B (en) | News text processing method based on automatic word segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |