CN105512335B - abstract searching method and device - Google Patents
abstract searching method and device Download PDFInfo
- Publication number
- CN105512335B CN105512335B CN201511009856.6A CN201511009856A CN105512335B CN 105512335 B CN105512335 B CN 105512335B CN 201511009856 A CN201511009856 A CN 201511009856A CN 105512335 B CN105512335 B CN 105512335B
- Authority
- CN
- China
- Prior art keywords
- window
- sentences
- sentence
- quasi
- abstract
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a method and a device for searching abstracts, wherein the method comprises the steps of carrying out sentence division on a preset document to obtain a plurality of sentences, carrying out static scoring on the sentences according to the weights of the sentences in the preset document to obtain a static score corresponding to each sentence, then receiving a search request sent by a terminal according to the static abstract of the preset document, wherein the static abstract comprises a plurality of sentences and the static scores corresponding to each sentence, the search request carries a search word, and the corresponding abstract is output according to the search word and the static abstract.
Description
Technical Field
The invention relates to the technical field of data search, in particular to abstract search methods and devices.
Background
The vertical search technology is applied to industries and professional searches, and is the extension and application subdivision of the searches.
The vertical search technology mainly relates to a vertical search abstract scheme, and the existing vertical search abstract scheme is as follows: the method comprises the steps of obtaining a search word of a user, carrying out word segmentation processing on the search word, matching words in a document according to the search word and word segmentation results, marking the words in the document with red according to the matching results, selecting the sentence with the highest red marking coverage as an abstract, and outputting the abstract.
During the research and practice process of the prior art, the inventor of the present invention finds that the speed of outputting the summary of the existing vertical search summary scheme is slow, and the quality of the output summary is poor.
Disclosure of Invention
The embodiment of the invention provides abstract searching methods and devices, which can solve the technical problems that the speed of outputting an abstract in the conventional vertical abstract searching scheme is low and the quality of the output abstract is poor.
The embodiment of the invention provides abstract searching methods, which comprise the following steps:
carrying out sentence division on a preset document to obtain a plurality of sentences, and carrying out static scoring on the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence;
generating a static abstract of the preset document, wherein the static abstract comprises the sentences and a static score corresponding to each sentence;
receiving a search request sent by a terminal, wherein the search request carries a search word;
and outputting a corresponding abstract according to the search term and the static abstract.
Correspondingly, the embodiment of the present invention further provides kinds of summary searching apparatuses, including:
the sentence processing module is used for carrying out sentence division on a preset document to obtain a plurality of sentences, and statically scoring the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence;
the abstract generating module is used for generating a static abstract of the preset document, wherein the static abstract comprises the sentences and a static score corresponding to each sentence;
the receiving module is used for receiving a search request sent by a terminal, and the search request carries a search term;
, an output module for outputting the corresponding abstract according to the search term and the static abstract.
The method comprises the steps of carrying out sentence division on a preset document to obtain a plurality of sentences, carrying out static scoring on the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence, then, receiving a search request sent by a terminal according to the static abstract of the preset document, wherein the static abstract comprises the plurality of sentences and the static score corresponding to each sentence, and outputting the corresponding abstract according to the search word and the static abstract; according to the scheme, the document is divided into sentences before searching, and the sentences are divided according to the weight of the sentences in the document, so that the speed of matching search words and sentences in vertical search of the abstract is increased, and compared with the prior art, the speed of outputting the abstract is increased; moreover, because the scheme scores sentences in the document in advance, the importance of the sentences in the document is given, the abstract can be output according to the importance of the sentences when the abstract is vertically searched, and compared with the prior art, the quality of the output abstract can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of methods for searching a summary according to an embodiment of the present invention;
FIG. 2 is a flow chart of output digests provided by the second embodiment of the present invention;
FIG. 3 is a flow chart of window selections according to the second embodiment of the present invention;
FIG. 4 is a flowchart of window markers according to a second embodiment of the present invention;
fig. 5a is a schematic structural diagram of kinds of summary search devices according to a third embodiment of the present invention;
FIG. 5b is a schematic structural diagram of sentence processing modules according to the third embodiment of the present invention;
fig. 5c is a schematic structural diagram of another kinds of summary searching devices according to the third embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.
The embodiment of the invention provides abstract searching methods and devices.
Examples ,
The present embodiment provides methods for searching for a summary, which may be implemented by a summary searching apparatus, and the summary searching apparatus may be specifically integrated in a server or other devices requiring summary searching.
As shown in fig. 1, the specific process of the abstract search method may be as follows:
101. the method comprises the steps of carrying out sentence division on a preset document to obtain a plurality of sentences, and carrying out static scoring on the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence.
Specifically, there are various ways of sentence division for the preset document, for example, sentences may be divided according to punctuations in the document, and preferably, after the sentences are divided, the sentences with the sentence length smaller than the th preset length may be merged, and the sentences with the sentence length larger than the second preset length may be segmented, so that the output of the abstract with a proper length can be ensured.
In this embodiment, the preset document may include web page content, such as a certain shopping web page content, which may include text content.
Specifically, the step of "statically scoring sentences according to their weights in the preset document" may include:
calculating the weight of the sentence in the preset document according to the position of the sentence in the preset document, the result of the sentence hitting the preset sentence template and the quality of words contained in the sentence;
and statically scoring the sentences according to the weight of the sentences in the preset document.
For example, the position of the sentence in the preset document may include the position of the paragraph where the sentence is located (the first paragraph, the last paragraph, or the middle position of the document, etc.), the position of the sentence in the paragraph (the head, the tail, or the middle position of the paragraph, etc.), and in practical applications, the sentence at the position of the first paragraph, the head of the paragraph, etc. is weighted.
In practical applications, templates of sentences with higher quality or higher importance may be preset, and the result of the sentence hit in the preset sentence template includes that the preset sentence template is hit or not hit.
Preferably, the larger the weight of the sentence in the document, the higher the static score in the present embodiment.
For example, when a sentence is in the first segment of a document, the sentence is a preset interrogative sentence, and the sentence is composed of a plurality of compound words, the th weight corresponding to the sentence in the first segment is obtained, the second weight corresponding to the sentence in the preset interrogative sentence is obtained, the third weight when the sentence is composed of a plurality of compound words is obtained, then the weight of the sentence in the document is calculated according to the th weight, the second weight and the third weight (for example, the weight average value is calculated), and then the sentence can be statically scored according to the weight, for example, the sentence is scored 80% when the weight is 80%, wherein, the rules for scoring the sentence according to the weight are various, and can be set according to actual requirements.
In practical application, some services do not need to provide the abstract of the vertical search, so that abstract fields do not need to be configured, and only when the services provide the abstract of the vertical search, the abstract fields need to be configured; the service can configure its abstract field on the page according to its service characteristics; thus, in the present embodiment, before sentence division is performed on the preset document, the method may further include: and judging whether the current service is configured with an abstract field, if so, executing the steps of sentence division and static scoring on a preset document corresponding to the service, and if not, generating the original document data in a json format.
102. And generating a static abstract of the preset document, wherein the static abstract comprises a plurality of sentences and a static score corresponding to each sentence.
In particular, in practical applications, the static abstract may further include attribute information of the sentence, such as the type of the sentence, the length of the sentence, the number of the sentence, and the like.
The static abstract can be structured data comprising a sentence layer and a sentence domain layer, wherein the sentence layer comprises the content of a plurality of sentences and the attribute information of each sentence, the sentence domain layer comprises a plurality of sentence domains, each sentence domain corresponds to or a plurality of sentence layers, and the subsequent window selection is carried out on the sentence domain layer, so that the selection of a single window does not span the semantic unit of the sentence.
103. And receiving a search request sent by a terminal, wherein the search request carries a search word.
For example, the background server may receive a search request sent by the terminal through the wireless network.
Taking news searching in an Tencent news webpage as an example, a terminal can receive news search words input by a user through an input interface provided by the webpage and send a search request to a corresponding background server, wherein the search request carries the news search words.
104. And outputting a corresponding abstract according to the search term and the static abstract.
Specifically, the search term may be matched with a sentence in the static abstract to obtain a matching result; then, a corresponding summary may be output based on the matching result, the static score of the sentence, and the like.
As can be seen from the above, the embodiment of the present invention performs sentence division on a preset document to obtain a plurality of sentences, performs static scoring on the sentences according to weights of the sentences in the preset document to obtain a static score corresponding to each sentence, then generates a static abstract of the preset document, where the static abstract includes the plurality of sentences and the static score corresponding to each sentence, receives a search request sent by a terminal, where the search request carries a search term, and outputs a corresponding abstract according to the search term and the static abstract; according to the scheme, the document is divided into sentences before searching, and the sentences are divided according to the weight of the sentences in the document, so that the speed of matching search words and sentences in vertical search of the abstract is increased, and compared with the prior art, the speed of outputting the abstract is increased; moreover, because the scheme scores sentences in the document in advance, the importance of the sentences in the document is given, the abstract can be output according to the importance of the sentences when the abstract is vertically searched, and compared with the prior art, the quality of the output abstract can be improved.
Example II,
In this embodiment, steps will be described on the basis of embodiment , and optionally, in order to improve the quality of the output abstract and prevent the problem that the words are scattered in each sentence, which results in poor mark quality of the abstract, the step "outputting the corresponding abstract according to the words and the static abstract" may include, with reference to fig. 2:
201. according to the matching result of the search word and the sentence, the static score corresponding to the sentence, the position of the sentence in the preset document and the sentence type corresponding to the sentence, sentences are selected from the sentences of the static abstract as the central sentences.
Specifically, the search term may be matched with the sentence, for example, the search term may be subjected to a word segmentation process, and then the search term and the word segmentation result are matched according to the sentence in the static abstract, that is, the matching result of the search term and the sentence includes: the matching result of the search word and the sentence, and the matching result of the word segmentation result of the search word and the sentence.
The position of the sentence in the preset document comprises: the position of the paragraph in which the sentence is located in the document, the position of the sentence in the paragraph, and the like; the sentence type corresponding to the sentence comprises: interrogative sentences, exclamatory sentences, declarative sentences, etc.
Preferably, the step 201 may specifically include:
dynamically scoring the sentences in the static abstract according to the matching result of the search word and the sentences, the static scores corresponding to the sentences, the positions of the sentences in a preset document and the sentence types corresponding to the sentences to obtain the dynamic scores corresponding to each sentence;
for example, the sentence with the highest dynamic score is selected from the sentences as the central sentence.
For example, when the search word is "china", the sentence is "i love people in china", and the static score of the sentence is 70 minutes, it can be known through search word matching that the sentence hits the search word, it can be known through searching that the static score of the sentence is 70 minutes, the sentence is the first sentence of the paragraph in which the sentence is located, the type of the sentence is an exclamation sentence, at this time, the weight of the static score, the weight of the sentence as the first sentence, and the weight of the sentence as the exclamation sentence can be obtained, then, the sentence is scored according to the obtained weights, here, the sentence can be determined as an important sentence, the number of the sentence can be set to be more than 80 minutes, specifically, the way of scoring the sentence according to the weights can be set according to actual needs, for example, the average weight is obtained, then, the sentence is scored according to the average weight, and so on the like.
202. And carrying out window expansion around the central sentence to obtain a plurality of windows.
The window includes a central sentence and at least other sentences, which are sentences other than the central sentence in the static abstract, for example, the window may include the central sentence and two other sentences, and the number of sentences included in the window may be set according to actual requirements.
203. And selecting a target window as an abstract from the plurality of windows, marking sentences in the target window according to the search words, and outputting the target window.
Specifically, the window may be dynamically scored according to a dynamic score corresponding to a sentence in the window, a length of the window, a position of a starting sentence in the window in a preset document, and a position of a hit sentence in the window, so as to obtain a dynamic score corresponding to each window, where the hit sentence is a sentence hit by a search term in the window; and then, selecting a target window as the abstract from the plurality of windows according to the dynamic score corresponding to the window. For example, after each window is dynamically scored, the window with the highest dynamic score is selected as the target window.
The length of the window can be obtained from the length of the sentence contained in the window, and the position of the starting sentence in the document in the window can include the position of th sentence in the window in the paragraph, the position of th sentence in the window in the document, and the like.
In this embodiment, there are various ways to mark sentences in the target window, such as marking sentences with colors (e.g., red-marked sentences, etc.).
For example, a window including sentence a and sentence b as the central sentence may be scored by: acquiring static scores of sentences a and b, acquiring the length of a window according to the lengths of the sentences a and b (the length of the window can be the sum of the lengths of the sentences a and b), determining that the sentence a is the initial position of the window, acquiring the position of the sentence a in a document, and acquiring the position of the sentence b in the window when a search word is hit in the sentence b; then, the static scores of the sentences a and b, the length of the window, the position of the sentence a in the document and the position of the sentence b in the document score the window dynamically; for example, the weight of the window is calculated according to the corresponding static score weight, the corresponding sentence hit weight and the corresponding window initial sentence weight, and then the window is dynamically scored according to the weight of the window.
In order to output a specified or required abstract and improve user experience, in the method of the embodiment, after a target window is selected and before the target window is output, whether the target window meets a preset window condition needs to be judged, if yes, the window is output, and if not, double-window selection is performed; that is, after the sentence in the target window is labeled according to the search word and before the target window is output, the method of this embodiment may further include:
judging whether the target window meets a preset window condition, if so, executing a step of outputting the target window;
if not, combining the windows two by two to obtain th combined windows (namely a plurality of double windows);
dynamically scoring th combined windows to obtain a dynamic score corresponding to each th combined window;
selecting a target window from a plurality of combination windows as a summary according to the dynamic score corresponding to the combination window, for example, selecting the combination window with the highest score as the target window of the summary.
And judging whether the target window meets the preset window condition, if so, outputting the target window.
To ensure the length of the summary and the mark coverage of the summary, the preset window condition may optionally include: the length of the target window is within a preset length range, and the mark coverage of the target window is within a preset mark coverage range. The preset length range and the preset mark coverage range can be set according to actual requirements.
There are various ways to dynamically score the th combined window, such as dynamically scoring the th combined window according to the length of the th combined window, the hit coverage of the search term of the th combined window, and the dynamic score of the th window, wherein the th window is the window participating in the th combined window.
For example, when the th combined window is composed of a window a and a window b, the length of the th combined window is obtained according to the lengths of the window a and the window b, the coverage of the hit search word in the th combined window is obtained, the dynamic scores of the window a and the window b are obtained, and then the th combined window is dynamically scored according to the obtained information.
Optionally, in order to output a required summary, such as outputting a summary of a fixed length or a mark coverage, after determining that a target window selected from a plurality of combination windows does not satisfy a preset window condition, the present embodiment may further perform triple window selection, specifically, the triple window selection process is as follows:
combining the th combined window and the second window two by two to obtain a plurality of second combined windows (namely three windows), wherein the second window is a window different from the th window in the plurality of windows;
dynamically scoring the second combined windows to obtain a dynamic score corresponding to each second combined window;
and selecting a target window as an abstract from the plurality of second combined windows according to the dynamic score of the second combined windows, marking sentences in the target window according to the retrieval words, and outputting the target window.
According to the above description, referring to fig. 3, the specific process of window selection in this embodiment may include the following steps:
301. a central sentence is selected from the plurality of sentences of the static summary.
The process of selecting the central sentence can refer to the above description.
302. Spread around the center sentence window to get k single windows.
After the single window is expanded, whether the length of the single window is larger than a preset length value or not can be judged, if yes, the single window is segmented, and the length of the output abstract is not too long.
303. And dynamically scoring the k single windows to obtain the dynamic score of each single window.
Where k is a positive integer greater than 2, and considerations for dynamic scoring may be found in reference to the above description.
304. And selecting the single window with the highest dynamic score as a target window, marking sentences in the target window according to the retrieval words, and taking the target window as an abstract.
305. And judging whether the length of the target window and the mark coverage of the target window meet preset conditions, if not, executing a step 306, and if so, turning to a step 311.
306. The k single windows are combined two by two to obtain a plurality of double windows.
307. And dynamically scoring the double windows to obtain the dynamic score of each double window, selecting the double window with the highest score as the target window, and marking the sentences in the target window according to the retrieval words.
308. And judging whether the length of the target window and the mark coverage of the target window meet preset conditions, if not, executing a step 309, and if so, turning to a step 311.
309. Two-by-two combination of the double window and the th single window is performed to obtain a plurality of three windows, wherein the th single window is a window which is different from the windows participating in the double window combination.
310. And dynamically scoring the three windows to obtain the dynamic score of each three window, selecting the three windows with the highest dynamic scores as target windows, and marking sentences in the target windows according to the retrieval words.
Specifically, there are various ways to dynamically score the three windows, for example, dynamically score the three windows according to the length of the three windows, the hit coverage of the search terms of the three windows, and the dynamic scores of the two windows and the single window participating in the three windows. The dynamic scoring for three windows may be similar to the dynamic scoring for two windows.
For example, when the three windows are composed of a double window c and a single window d, the length of the three windows is obtained according to the lengths of the double window c and the single window d, the coverage of the three windows for hitting the search word is obtained, the dynamic scores of the double window c and the single window d are obtained, and then the three windows are dynamically scored according to the obtained information.
And 311, outputting the target window.
Optionally, to prevent occurrence of marked flooding, such as marked red flooding, the marking of sentences in this embodiment may be divided into a pseudo-marking process and a de-marking process, where a word to be marked is determined in the pseudo-marking process, and a marking qualification of the word to be marked is removed in the de-marking process, and specifically, referring to fig. 4, the step "marking sentences in the target window according to the search word" may include:
401. and matching the search words with sentences in the target window to obtain a matching result.
Specifically, the search word may be segmented to obtain a segmentation result, and then the segmentation result is matched with words in the target window one by one, so that aspect may avoid false tagging of non-word units in the hit search, and aspect may avoid missing tagging or false tagging caused by text and search word segmentation not .
402. And determining words to be marked in the target window according to the matching result to obtain an th quasi-marked word set, wherein the th quasi-marked word set comprises a plurality of quasi-marked words, and the quasi-marked words are the words to be marked.
403. Analyzing the search word to obtain an analysis result, and obtaining a set of pseudo-de-labeled words according to the analysis result, wherein the set of pseudo-de-labeled words comprises at least pseudo-de-labeled words.
The set of to-be-unmarked words may be presented in tabular form, where the to-be-unmarked words are words that are not necessarily strongly marked or should not be marked when hit alone, which may include the following:
(1) the word is stopped. Words such as "a," "an," "the," etc. should not be separately labeled unless they are contiguous with other strings. Recognition of stop words is aided by a static stop word vocabulary.
(2) Punctuation, symbols. The identification of punctuation and symbol is assisted by a static punctuation and symbol table.
(3) Words are not necessarily left in the search term.
(4) Compound word substrings. For compound words, the whole string of marks are preferred, and when the whole string of marks is marked, the substrings do not need to be marked.
In practical application, not all the hit words need to be labeled, so the embodiment can remove the labeling qualification of the word to be labeled according to the set of the to-be-labeled words, and prevent the occurrence of label flooding.
404. And removing corresponding quasi-tagged words in the th quasi-tagged word set according to the quasi-unmarked word set to obtain a second quasi-tagged word set.
In practical applications, it is not absolute whether hits should be tagged or not, depending on the specific tagging situation, even stop words, should be tagged without other hits, therefore, the present implementation may set the tagging priority of each pseudo-tagged word, and then remove the tagging qualification according to the priority, specifically, this step 404 may include:
setting quasi-tagged word set with corresponding tag priority according to the quasi-tag word set;
removing the quasi-tagged words in the quasi-tagged word set according to the tagging priorities corresponding to the quasi-tagged words in the quasi-tagged word set.
For example, the pseudo-logograms may be divided into five levels (red for example) according to the pseudo-logogram set:
level-full term reproduction, such hits must be marked red whenever;
and (2) second stage: important words in compound words and search words;
third-stage: individual compound substrings, synonyms;
and (4) fourth stage: single words (the single words herein do not include important single words in the search term, such as "palace" in "palace download", which should belong to the second level);
and (5) fifth stage: stop words, punctuation marks and symbols (not including important symbols in the search word, if the search word is 'the' at this time, the book name and the number should belong to the second level).
Preferably, the process of label removal according to priority in this embodiment may be carried out in a sequence from weak to strong according to priority, and whether to implement label removal may be determined by referring to the current label density and priority level every time when the current label density reaches , for example, in the above priority division example, the process of label removal may include:
for the quasi-red strings marked with red in the fifth level (quasi-red words), if quasi-epi-red strings smaller than the fifth level exist, the quasi-red strings marked with red in all the fifth level are de-marked, namely the quasi-red strings are removed from the th quasi-marked word set;
for the quasi-epi-red strings marked with red at the four levels, if the quasi-epi-red strings smaller than the four levels exist, the quasi-epi-red strings marked with red at all the four levels are subjected to red removing;
for the three-level red-marked quasi-standard red strings, if the quasi-standard red strings smaller than the three levels exist and the density of the quasi-standard red strings smaller than the three levels reaches fixed threshold values, the quasi-standard red strings of the three-level red-marked quasi-standard red are de-marked;
for the level and two level red-marked quasi-marked red strings, no de-marking red is performed.
405. And marking the words in the target window according to the second quasi-marked word set.
For example, the words belonging to the second pseudo-tagged word in the target window are marked with red.
As can be seen from the above, in the embodiment of the present invention, a central sentence is selected from a plurality of sentences of a static abstract, then window expansion is performed around the central sentence to obtain a plurality of single windows, then, a target window serving as an abstract is selected from the plurality of single windows, when the selected target window does not satisfy a window condition, double-window selection is performed, and when the selected double-window does not satisfy the window condition, three-window selection is performed; the method for outputting the abstract according to the search term and the static abstract can improve the quality of the output abstract, for example, the length of the abstract and the marking quality can be ensured; in addition, the embodiment of the invention adopts the processes of pseudo-marking and unmarking to mark the abstract, thereby preventing the abstract from being marked abundantly.
Example III,
In order to better implement the above method, the embodiment of the present invention further provides summary search devices, which may include a sentence processing module 501, a summary generating module 502, a receiving module 503, and a output module 504, as shown in fig. 5a, as follows:
a sentence processing module 501, configured to perform sentence division on a preset document to obtain multiple sentences, and perform static scoring on the sentences according to weights of the sentences in the preset document to obtain a static score corresponding to each sentence;
a summary generating module 502, configured to generate a static summary of the preset document, where the static summary includes the multiple sentences and a static score corresponding to each sentence;
a receiving module 503, configured to receive a search request sent by a terminal, where the search request carries a search term;
, an output module 504, configured to output a corresponding summary according to the search term and the static summary.
For example, the sentence processing module 501 may be specifically configured to:
calculating the weight of the sentence in the preset document according to the position of the sentence in the preset document, the result of the sentence hitting a preset sentence template and the quality of words contained in the sentence;
and statically scoring the sentences according to the weight of the sentences in the preset document.
For example, referring to fig. 5b, the th output module 504 may include:
a sentence selection submodule 5041, configured to select sentences from the multiple sentences of the static abstract as central sentences according to matching results between the search terms and the sentences, static scores corresponding to the sentences, positions of the sentences in a preset document, and sentence types corresponding to the sentences;
a window expansion submodule 5042, configured to perform window expansion around the central sentence to obtain a plurality of windows, where each window includes the central sentence and at least other sentences, where the other sentences are sentences other than the central sentence in the plurality of sentences;
an output sub-module 5043, configured to select a target window from the multiple windows as an abstract, mark sentences in the target window according to the search term, and output the target window.
Preferably, the sentence selection sub-module 5041 may be specifically configured to: according to the matching result of the search words and the sentences, the static scores corresponding to the sentences, the positions of the sentences in a preset document and the sentence types corresponding to the sentences, dynamically scoring the sentences in the static abstract to obtain the dynamic scores corresponding to the sentences;
sentences are selected from the sentences to be used as central sentences according to the dynamic scores corresponding to the sentences.
In order to ensure the quality of the output summary, referring to fig. 5c, the summary searching apparatus in this embodiment may further include a th judging module 505 and a second outputting module 506;
an th determining module 505, configured to determine whether the target window meets a preset window condition before outputting the target window after the st output sub-module marks the sentence in the target window according to the search word;
a second output module 506, configured to dynamically score the th combined window to obtain a dynamic score corresponding to each th combined window when the determination result of the determination module is negative, select the target window serving as the abstract from the th combined windows according to the dynamic score corresponding to the th combined window, determine whether the target window meets the preset window condition, and if so, output the target window;
among them, the th output sub-module 5043 may be specifically configured to:
dynamically scoring the window according to the dynamic score corresponding to the sentence in the window, the length of the window, the position of the initial sentence in the window in a preset document and the position of the hit sentence in the window to obtain the dynamic score corresponding to each window, wherein the hit sentence is the sentence hit to the search word in the window;
selecting the target window as an abstract from the plurality of windows according to the dynamic score corresponding to the window;
when the determination of the is yes, the target window is output.
In order to prevent the summary flag from flooding, the th output sub-module 5043 in this embodiment is specifically configured to:
selecting a target window as an abstract from the plurality of windows;
matching the search words with sentences in the target window to obtain matching results;
determining words to be marked in the target window according to the matching result to obtain an th quasi-marked word set, wherein the th quasi-marked word set comprises a plurality of quasi-marked words, and the quasi-marked words are words to be marked;
analyzing the search words to obtain an analysis result, and acquiring a set of quasi-unmarked words according to the analysis result, wherein the set of quasi-unmarked words comprises at least quasi-unmarked words;
removing corresponding quasi-tagged words in the th quasi-tagged word set according to the quasi-unmarked word set to obtain a second quasi-tagged word set;
and marking words in the target window according to the second quasi-marked word set, and outputting the target window.
In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily to be implemented as or several entities, and specific implementations of the above modules may refer to the foregoing method embodiments, and are not described herein again.
The abstract searching device can be particularly integrated in a server or other equipment needing abstract searching.
As can be seen from the above, the abstract searching apparatus of the present embodiment performs sentence division on a preset document through the sentence processing module 501 to obtain a plurality of sentences, performs static scoring on the sentences according to the weights of the sentences in the preset document to obtain a static score corresponding to each sentence, then generates a static abstract of the preset document through the abstract generating module 502, the static abstract includes the plurality of sentences and the static score corresponding to each sentence, receives a search request sent by a terminal through the receiving module 503, the search request carries a search term, outputs a corresponding abstract according to the search term and the static abstract through the output module 504, improves the speed of matching the search term and the sentence in a vertical search because the scheme performs sentence division on the document before searching, and performs scoring on the sentences according to the weights of the sentences in the document, improves the speed of outputting the abstract compared with the prior art, and also improves the quality of outputting the abstract compared with the prior art because the scheme performs scoring on the sentences in the document in advance.
It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program stored in computer readable storage medium, which may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, etc.
The abstract searching methods and apparatuses provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention, and meanwhile, for those skilled in the art, according to the idea of the present invention, there are changes in the specific embodiments and the application scope, and in conclusion, the content of the present description should not be construed as limiting the present invention.
Claims (18)
1, A method for searching abstract, comprising:
carrying out sentence division on a preset document to obtain a plurality of sentences, and carrying out static scoring on the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence;
generating a static abstract of the preset document, wherein the static abstract comprises the sentences and a static score corresponding to each sentence;
receiving a search request sent by a terminal, wherein the search request carries a search word;
selecting sentences from the plurality of sentences of the static abstract as central sentences according to the matching result of the retrieval words and the sentences, the static scores corresponding to the sentences, the positions of the sentences in a preset document and the sentence types corresponding to the sentences;
performing window expansion around the central sentence to obtain a plurality of windows, wherein each window comprises the central sentence and at least other sentences, and the other sentences are sentences except the central sentence in the plurality of sentences;
and selecting a target window as an abstract from the plurality of windows, marking sentences in the target window according to the retrieval words, and outputting the target window.
2. The abstract searching method according to claim 1, wherein the step of statically scoring the sentences according to the weights of the sentences in the preset documents specifically comprises:
calculating the weight of the sentence in the preset document according to the position of the sentence in the preset document, the result of the sentence hitting a preset sentence template and the quality of words contained in the sentence;
and statically scoring the sentences according to the weight of the sentences in the preset document.
3. The abstract searching method according to claim 1, wherein the step of selecting sentences from the plurality of sentences of the static abstract as the central sentences according to the matching result of the search term and the sentence, the static score corresponding to the sentence, the position of the sentence in the preset document and the sentence type corresponding to the sentence specifically comprises:
according to the matching result of the search words and the sentences, the static scores corresponding to the sentences, the positions of the sentences in a preset document and the sentence types corresponding to the sentences, dynamically scoring the sentences in the static abstract to obtain the dynamic scores corresponding to the sentences;
sentences are selected from the sentences to be used as central sentences according to the dynamic scores corresponding to the sentences.
4. The digest search method of claim 1, wherein the digest search unit is configured to search the digest of the document,
the step of selecting a target window from the plurality of windows as the summary specifically includes:
dynamically scoring the window according to the dynamic score corresponding to the sentence in the window, the length of the window, the position of the initial sentence in the window in a preset document and the position of the hit sentence in the window to obtain the dynamic score corresponding to each window, wherein the hit sentence is the sentence hit to the search word in the window;
and selecting the target window as the abstract from the plurality of windows according to the dynamic score corresponding to the window.
5. The abstract searching method of claim 4, wherein after the sentence in the target window is labeled according to the search word and before the target window is output, the abstract searching method further comprises:
judging whether the target window meets a preset window condition, if so, executing a step of outputting the target window;
if not, combining the windows two by two to obtain a plurality of th combined windows;
dynamically scoring the th combined window to obtain a dynamic score corresponding to each th combined window;
selecting the target window from the combined windows as a summary according to the dynamic score corresponding to the combined window;
and judging whether the target window meets the preset window condition, if so, outputting the target window.
6. The abstract searching method of claim 5, wherein the step of dynamically scoring the th combined window specifically comprises:
and dynamically scoring the th combined window according to the length of the th combined window, the hit coverage of the search term of the th combined window and the dynamic score of the th window, wherein the th window is a window participating in the th combined window.
7. The abstract searching method of claim 5, wherein the preset window condition comprises: the length of the target window is within a preset length range, and the mark coverage of the target window is within a preset mark coverage range.
8. The abstract searching method of claim 6, wherein after determining that the target window selected from the -th combined windows does not satisfy the preset window condition, the abstract searching method further comprises:
combining the th combined window and a second window in pairs to obtain a plurality of second combined windows, wherein the second window is a window different from the th window in the plurality of windows;
dynamically scoring the second combined windows to obtain a dynamic score corresponding to each second combined window;
and selecting a target window as an abstract from the plurality of second combined windows according to the dynamic score of the second combined windows, marking sentences in the target window according to the retrieval words, and outputting the target window.
9. The abstract searching method of claim 1, wherein the step of labeling the sentences in the target window according to the search term specifically comprises:
matching the search words with sentences in the target window to obtain matching results;
determining words to be marked in the target window according to the matching result to obtain an th quasi-marked word set, wherein the th quasi-marked word set comprises a plurality of quasi-marked words, and the quasi-marked words are words to be marked;
analyzing the search words to obtain an analysis result, and acquiring a set of quasi-unmarked words according to the analysis result, wherein the set of quasi-unmarked words comprises at least quasi-unmarked words;
removing corresponding quasi-tagged words in the th quasi-tagged word set according to the quasi-unmarked word set to obtain a second quasi-tagged word set;
and marking the words in the target window according to the second quasi-marked word set.
10. The abstract searching method of claim 9, wherein the step of removing corresponding quasi-tagged words in the quasi-tagged word set according to the quasi-untagged word set specifically comprises:
setting the marking priority corresponding to the quasi-marked words in the th quasi-marked word set according to the quasi-unmarked word set;
removing the quasi-tagged words in the quasi-tagged word set according to the tagging priorities corresponding to the quasi-tagged words in the quasi-tagged word set.
The abstract searching device of the 11 kinds and kinds is characterized by comprising:
the sentence processing module is used for carrying out sentence division on a preset document to obtain a plurality of sentences, and statically scoring the sentences according to the weight of the sentences in the preset document to obtain a static score corresponding to each sentence;
the abstract generating module is used for generating a static abstract of the preset document, wherein the static abstract comprises the sentences and a static score corresponding to each sentence;
the receiving module is used for receiving a search request sent by a terminal, and the search request carries a search term;
an th output module comprising:
a sentence selection submodule, configured to select sentences from the multiple sentences of the static abstract as central sentences according to matching results between the search terms and the sentences, static scores corresponding to the sentences, positions of the sentences in a preset document, and sentence types corresponding to the sentences;
a window expansion submodule, configured to perform window expansion around the central sentence to obtain multiple windows, where each window includes the central sentence and at least other sentences, and the other sentences are sentences other than the central sentence in the multiple sentences;
an output sub-module, configured to select a target window from the multiple windows as an abstract, mark sentences in the target window according to the search term, and output the target window.
12. The apparatus for searching for an abstract according to claim 11, wherein the sentence processing module is specifically configured to:
calculating the weight of the sentence in the preset document according to the position of the sentence in the preset document, the result of the sentence hitting a preset sentence template and the quality of words contained in the sentence;
and statically scoring the sentences according to the weight of the sentences in the preset document.
13. The abstract searching device of claim 11, wherein the sentence selection submodule is specifically configured to:
according to the matching result of the search words and the sentences, the static scores corresponding to the sentences, the positions of the sentences in a preset document and the sentence types corresponding to the sentences, dynamically scoring the sentences in the static abstract to obtain the dynamic scores corresponding to the sentences;
sentences are selected from the sentences to be used as central sentences according to the dynamic scores corresponding to the sentences.
14. The apparatus for searching the abstract of claim 11, further comprising a th judgment module, a second output module;
the th judging module is configured to judge whether the target window meets a preset window condition before outputting the target window after the st output sub-module marks the sentences in the target window according to the search word;
the second output module is used for dynamically scoring the th combined window to obtain a dynamic score corresponding to each th combined window when the th judgment module judges that the combined window is not the combined window, selecting the target window serving as the abstract from the th combined windows according to the dynamic score corresponding to the th combined window, judging whether the target window meets the preset window condition or not, and if the target window meets the preset window condition, outputting the target window;
the th output sub-module is specifically configured to:
dynamically scoring the window according to the dynamic score corresponding to the sentence in the window, the length of the window, the position of the initial sentence in the window in a preset document and the position of the hit sentence in the window to obtain the dynamic score corresponding to each window, wherein the hit sentence is the sentence hit to the search word in the window;
selecting the target window as an abstract from the plurality of windows according to the dynamic score corresponding to the window;
when the judgment module judges yes, the target window is output.
15. The abstract searching device of claim 11, wherein the th output sub-module is specifically configured to:
selecting a target window as an abstract from the plurality of windows;
matching the search words with sentences in the target window to obtain matching results;
determining words to be marked in the target window according to the matching result to obtain an th quasi-marked word set, wherein the th quasi-marked word set comprises a plurality of quasi-marked words, and the quasi-marked words are words to be marked;
analyzing the search words to obtain an analysis result, and acquiring a set of quasi-unmarked words according to the analysis result, wherein the set of quasi-unmarked words comprises at least quasi-unmarked words;
removing corresponding quasi-tagged words in the th quasi-tagged word set according to the quasi-unmarked word set to obtain a second quasi-tagged word set;
and marking words in the target window according to the second quasi-marked word set, and outputting the target window.
16. The digest searching apparatus of claim 15, wherein the predetermined window condition includes: the length of the target window is within a preset length range, and the mark coverage of the target window is within a preset mark coverage range.
17. The apparatus for searching for an abstract of claim 15, wherein the th output sub-module is specifically configured to:
setting the marking priority corresponding to the quasi-marked words in the th quasi-marked word set according to the quasi-unmarked word set;
removing the quasi-tagged words in the quasi-tagged word set according to the tagging priorities corresponding to the quasi-tagged words in the quasi-tagged word set.
18, computer readable storage medium storing a computer program for summary search, wherein the computer program causes a computer to perform the method of any of claims 1-10 to .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511009856.6A CN105512335B (en) | 2015-12-29 | 2015-12-29 | abstract searching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511009856.6A CN105512335B (en) | 2015-12-29 | 2015-12-29 | abstract searching method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512335A CN105512335A (en) | 2016-04-20 |
CN105512335B true CN105512335B (en) | 2020-01-31 |
Family
ID=55720315
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511009856.6A Active CN105512335B (en) | 2015-12-29 | 2015-12-29 | abstract searching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512335B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107342879B (en) * | 2016-04-29 | 2020-06-05 | 北京京东尚科信息技术有限公司 | Method, apparatus, and computer-readable storage medium for determining service evaluation requests to network users |
CN111241267B (en) * | 2020-01-10 | 2022-12-06 | 科大讯飞股份有限公司 | Abstract extraction and abstract extraction model training method, related device and storage medium |
CN111753043B (en) * | 2020-06-22 | 2024-04-16 | 北京百度网讯科技有限公司 | Document data processing method, device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539923A (en) * | 2008-03-18 | 2009-09-23 | 北京搜狗科技发展有限公司 | Method and device for extracting text segment from file |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
KR101508260B1 (en) * | 2014-02-04 | 2015-04-07 | 성균관대학교산학협력단 | Summary generation apparatus and method reflecting document feature |
CN104699847A (en) * | 2015-02-13 | 2015-06-10 | 刘秀磊 | Method and device for extracting summaries from web pages |
-
2015
- 2015-12-29 CN CN201511009856.6A patent/CN105512335B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539923A (en) * | 2008-03-18 | 2009-09-23 | 北京搜狗科技发展有限公司 | Method and device for extracting text segment from file |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
KR101508260B1 (en) * | 2014-02-04 | 2015-04-07 | 성균관대학교산학협력단 | Summary generation apparatus and method reflecting document feature |
CN104699847A (en) * | 2015-02-13 | 2015-06-10 | 刘秀磊 | Method and device for extracting summaries from web pages |
Also Published As
Publication number | Publication date |
---|---|
CN105512335A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109657054B (en) | Abstract generation method, device, server and storage medium | |
JP6653334B2 (en) | Information extraction method and device | |
US8843815B2 (en) | System and method for automatically extracting metadata from unstructured electronic documents | |
WO2018201600A1 (en) | Information mining method and system, electronic device and readable storage medium | |
JP6335898B2 (en) | Information classification based on product recognition | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN111159412B (en) | Classification method, classification device, electronic equipment and readable storage medium | |
US20160188569A1 (en) | Generating a Table of Contents for Unformatted Text | |
CN107861948B (en) | Label extraction method, device, equipment and medium | |
CN110674297B (en) | Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment | |
CN105512335B (en) | abstract searching method and device | |
CN107515849A (en) | It is a kind of into word judgment model generating method, new word discovery method and device | |
CN113408660B (en) | Book clustering method, device, equipment and storage medium | |
CN112149386A (en) | Event extraction method, storage medium and server | |
CN111091834A (en) | Text and audio alignment method and related product | |
CN111199151A (en) | Data processing method and data processing device | |
CN111492364B (en) | Data labeling method and device and storage medium | |
CN112214984A (en) | Content plagiarism identification method, device, equipment and storage medium | |
CN110751234A (en) | OCR recognition error correction method, device and equipment | |
CN112579729A (en) | Training method and device for document quality evaluation model, electronic equipment and medium | |
KR101143650B1 (en) | An apparatus for preparing a display document for analysis | |
CN110705261B (en) | Chinese text word segmentation method and system thereof | |
CN109710864B (en) | Page content dividing method and device, readable storage medium and electronic equipment | |
CN111046649A (en) | Text segmentation method and device | |
KR102015454B1 (en) | Method for automatically editing pattern of document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |