CN110427482A - A kind of abstracting method and relevant device of object content - Google Patents
A kind of abstracting method and relevant device of object content Download PDFInfo
- Publication number
- CN110427482A CN110427482A CN201910716302.1A CN201910716302A CN110427482A CN 110427482 A CN110427482 A CN 110427482A CN 201910716302 A CN201910716302 A CN 201910716302A CN 110427482 A CN110427482 A CN 110427482A
- Authority
- CN
- China
- Prior art keywords
- wonderful
- training
- text
- abstract
- ranking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 255
- 230000009194 climbing Effects 0.000 claims abstract description 73
- 230000003542 behavioural effect Effects 0.000 claims description 29
- 238000000605 extraction Methods 0.000 abstract description 8
- 239000000686 essence Substances 0.000 description 69
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000003860 storage Methods 0.000 description 8
- 230000006399 behavior Effects 0.000 description 5
- 238000003780 insertion Methods 0.000 description 5
- 230000037431 insertion Effects 0.000 description 5
- 241001269238 Data Species 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 238000005520 cutting process Methods 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 208000019901 Anxiety disease Diseases 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000036506 anxiety Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000005086 pumping Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of abstracting method of object content and relevant devices, comprising: acquisition training text first;Then the training information of training text is determined;Secondly according to training information, the wonderful of each paragraph in multiple paragraphs is determined;It is trained to obtain thick climbing form type of making a summary to training pattern to first further according to training information and wonderful;Then according to the thick climbing form type of making a summary, determine that second is made a summary essence row's model to the training data of training pattern with training;Finally, according to thick climbing form type and the smart object content arranged model and determine text to be processed of abstract of making a summary.Using the embodiment of the present invention, the automatic extraction of the splendid contents for books or long text may be implemented.
Description
Technical field
The present invention relates to the abstracting methods and correlation of natural language processing technique field more particularly to a kind of object content to set
It is standby.
Background technique
With the rapid development of Internet technology, people receive the information of magnanimity daily.In order to rapidly believe from magnanimity
Information needed is obtained in breath, abstract/splendid contents abstracting method becomes hot research technology.Currently, plucking of using has been put into
Want/splendid contents abstracting method includes: the most common unsupervised TextRank algorithm of (1) industry, which is substantially
A kind of sort algorithm based on figure.(2) classical to have supervision extraction-type digest algorithm, such as SummaRuNNer model, the model
Main thought is to turn to extraction-type abstract mission profile to do serializing mark to sentence.However, on the one hand, it is calculated with TextRank
Method is that the unsupervised algorithm of representative can only be believed between not can avoid sentence in view of the semantic information of shallow-layer between sentence, the abstract of generation
The problem of ceasing redundancy, and some surfaces (such as reader conduct feature) cannot be made full use of.On the other hand, with
SummaRuNNer model has supervision abstract model not make full use of the information of pre-training for representative, and not can guarantee life
At abstract be the segment that is mutually related.The third aspect, above-mentioned two classes method do not simply fail to the type information for directly utilizing text,
And it can not directly migrate in the splendid contents extraction task of the long texts such as books.
Summary of the invention
The present invention provides the abstracting method and relevant device of a kind of object content, may be implemented for books or long text
The automatic extraction of splendid contents.
In a first aspect, the embodiment of the invention provides a kind of abstracting methods of object content, comprising:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include more in first training text
The behavioural characteristic of the reader of a paragraph, the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern,
Obtain thick climbing form type of making a summary;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
Wherein, described according to the thick climbing form type of abstract, determine that the object content of text to be processed includes:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Wherein, described according to the thick climbing form type of the abstract and abstract essence row's model, determine that the object content includes:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined
Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined
Marking content includes the target wonderful.
Wherein, the behavioural characteristic includes reader comment's number or reader's scribing line number;
It is described according to the thick climbing form type of the abstract, determine that second includes: to the training data of training pattern
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
Described second to be trained training data input to training pattern, the essence row's model that obtains making a summary includes:
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in
The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck
Want essence row's model.
Wherein, the text to be processed includes multiple chapters and sections;
It is described that model is arranged according to the abstract essence, determine each candidate wonderful in the multiple candidate wonderful
Excellent degree ranking includes:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold
Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Wherein, described according to the sequence threshold value and abstract essence row's model, determine that the excellent degree ranking includes:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful,
Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute
State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class
Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking
Name.
Wherein, the excellent degree ranking that the insincere wonderful is determined according to the excellent degree prediction ranking
Include:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Wherein, described that the excellent of the credible wonderful is determined according to the popular class and abstract essence row's model
Degree ranking includes:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can
Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined
Ranking.
Wherein, described according to the thick climbing form type of abstract, after the object content for determining text to be processed, further includes:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user
This.
Second aspect, the embodiment of the invention provides a kind of draw-out devices of object content, comprising:
Sample collection module, for obtaining the first training sample, first training text is that text size is more than to preset
The long text of threshold value;
Information determination module, for determining that the training information of first training text, the training information include described
The type information of multiple paragraphs, first training text in first training text and the reader of first training text
Behavioural characteristic;
The information determination module is also used to determine each paragraph in the multiple paragraph according to the behavioural characteristic
Wonderful;
Model training module, for instructing the type information and wonderful input first to training pattern
Practice, obtains thick climbing form type of making a summary;
Text snippet module, for determining the object content of text to be processed according to the thick climbing form type of abstract.
Wherein, the model training module is also used to:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
The text snippet module is also used to:
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Wherein, the text snippet module is also used to:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined
Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined
Marking content includes the target wonderful.
Wherein, the behavioural characteristic includes reader comment's number or reader's scribing line number;
The sample collection module is also used to:
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
The information determination module is also used to:
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
The model training module is also used to:
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in
The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck
Want essence row's model.
Wherein, the text to be processed includes multiple chapters and sections;
The text snippet module is also used to:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold
Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Wherein, the text snippet module is also used to:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful,
Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute
State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class
Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking
Name.
Wherein, the text snippet module is also used to:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Wherein, the text snippet module is also used to:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can
Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined
Ranking.
Wherein, described device further includes display module, is used for:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user
This.
The third aspect, the embodiment of the invention provides a kind of extracting devices of object content, comprising: processor, memory
And communication bus, wherein communication bus executes in memory for realizing connection communication between processor and memory, processor
Step in a kind of abstracting method for object content that the program of storage provides for realizing above-mentioned first aspect.
In a possible design, Entity recognition equipment provided by the invention be may include for executing in the above method
The corresponding module of behavior.Module can be software and/or hardware.
The another aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage
A plurality of instruction is stored in medium, described instruction is suitable for being loaded as processor and executing method described in above-mentioned various aspects.
The another aspect of the embodiment of the present invention provides a kind of computer program product comprising instruction, when it is in computer
When upper operation, so that computer executes method described in above-mentioned various aspects.
Implement the embodiment of the present invention, first acquisition training text;Then the training information of training text is determined;Secondly basis
Training information determines the wonderful of each paragraph in multiple paragraphs;Further according to training information and wonderful to first wait instruct
Practice model and is trained to obtain thick climbing form type of making a summary;Then according to thick climbing form type of making a summary, the second training to training pattern is determined
Data are with training abstract essence row's model;Finally, according to thick climbing form type and the smart mesh arranged model and determine text to be processed of abstract of making a summary
Mark content.The automatic extraction of the splendid contents for books or long text may be implemented.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly or in background technique below will be implemented the present invention
Attached drawing needed in example or background technique is illustrated.
Fig. 1 is a kind of flow diagram of the abstracting method of object content provided in an embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of thick climbing form type of making a summary provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of the abstracting method of another object content provided in an embodiment of the present invention;
Fig. 4 is a kind of flow diagram of excellent degree ranking provided in an embodiment of the present invention;
Fig. 5 is a kind of flow diagram of two stage object content abstracting method provided in an embodiment of the present invention;
Fig. 6 is a kind of draw-out device structural schematic diagram of object content provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of the extracting device of object content provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Referring to Figure 1, Fig. 1 is a kind of flow diagram of the abstracting method of object content provided in an embodiment of the present invention,
This method includes but is not limited to following steps:
S101, obtains training text, which is the long text that text size is more than preset threshold.
In the specific implementation, training text can be the long text that text size is more than preset threshold, preset threshold be can be
Refer to total number of word (such as 10,000 words), the total chapters and sections/paragraph number of training text.Wherein it is possible to but be not limited to obtain more complete books
Nationality, and using every complete books as a training text.
S102 determines the training information of training text, which includes multiple paragraphs, training text in training text
The behavioural characteristic of the reader of type information originally and training text.
In the specific implementation, in a first aspect, training information may include multiple paragraphs in training text, wherein can be first
Elder generation is multiple paragraphs to training text (i.e. a complete book) progress cutting, then by each chapter construction according to chapters and sections, wherein
If the number of words of some paragraph more than threshold value (such as 504), needs again, by the paragraph, cutting is two or more numbers of words again
No more than the paragraph of the threshold value.For example, including 3 chapters and sections in books XXX, wherein include that 3 numbers of words are less than in the 1st chapters and sections
It include being less than in paragraph and the 3rd chapters and sections of 4 numbers of words less than 504 including 2 numbers of words in 504 paragraph, the 2nd chapters and sections
504 paragraph and 1 number of words are 896 paragraph, therefore then need the paragraph for being 896 by number of words to be divided into and separately include 504
With two paragraphs of 392 words, to obtain 3+4+2+2=11 paragraph in the training text.
Second aspect, since the books/text style in different type or field is totally different, training information can also include
The type information of training text.Wherein, type information can be the books type sorted out when book publishing, and such as literature and art are hanged
Doubt novel etc..
The third aspect, training information can also include the behavioural characteristic of the reader of training text.Wherein, for the essence in book
Color content, reader generally prefer that and are recorded by way of crossing or commenting on, therefore behavioural characteristic can be at one section
In or from reader since book publishing for the scribing line number of each paragraph/chapters and sections in the books or comment number.
S103 determines the wonderful of each paragraph in multiple paragraphs according to behavioural characteristic.
In the specific implementation, can be, but not limited to reader's scribing line number or reader comment's number in each paragraph being greater than certain threshold value
Wonderful of the segment as the paragraph.Wherein, the threshold value can be according to the shelf lifes of books, point reading/sales volume
Deng because of usually comprehensive analysis and determination.Wherein, if the reader for some paragraph occur crosses number or reader comment's number be 0 feelings
Condition, then a kind of possible countermeasure are as follows: set no answer for the wonderful of the paragraph.
Type information, multiple paragraphs and wonderful are inputted and are trained to training pattern by S104, are obtained abstract and are slightly mentioned
Model.
In the specific implementation, can first serialize type information (being denoted as A), A={ a is obtained1,a2,...,an, so
Afterwards, for each paragraph, first the paragraph (being denoted as Q) is serialized, obtains Q={ q1,q2,...,qn, it recycles wait train
A and Q are spliced for distinguishing the additional character of different types of sequence in model, obtain a group model training data (note
For I), wherein can be to training pattern based on BERT (Bidirectio-nal Encoder Representation from
Transformers model) for distinguishing the additional character of Q and A is respectively CLS and SEP in BERT model, to obtain
I={ [CLS];A[SEP];Q[SEP]} (1)
Certainly, it can also be SummaRuNNer model to training pattern, but need to make the model in training
Adjustment.For example, it is desired to which the problem of resampling or down-sampled method are to alleviate class imbalance is added.
Such as: it is illustrated in figure 2 the thick climbing form type of abstract that training obtains, which utilizes complete self-consciou power mechanism
Feature Mapping including insertion expression layer, BERT layers, abstraction, layer.Wherein, insertion expression layer can be divided into character/word insertion table again
Show that (token embedding) layer, paragraph insertion indicate that (segment embedding) layer and position insertion indicate
(position embedding) layer, for carrying out vectorization expression to input text from multiple dimensions.As shown, input letter
The input of i-th of position of breath is by obtaining the vector T of a hidden layer after BERT layersi, wherein i-th of position is input
I-th of word where position, for example, in Fig. 2, the position where " " be position where the 3rd position, " only " i.e.
For the 7th position.Then corresponding starting (start) probability in each position and termination are calculated using an abstraction, layer
(end) probability, wherein start probability indicates that the position is that probability, the end probability of the starting point of wonderful indicate the position
For the probability of the end point of wonderful.Using start matrix S and end the matrix E learnt during model training come into
Row eigentransformation, the start probability that each position can be obtained (are denoted as P1) and end probability (be denoted as P2), wherein i-th of position
P1And P2It can be calculated respectively according to (2) formula and (3) formula.
In practical applications, the final result of the thick climbing form type of making a summary output is one of start probability and end maximum probability
The corresponding text fragments in continuous and legal section.For example, as shown in Fig. 2, it " is solitarily one that thick climbing form type of making a summary, which will export,
... " where paragraph wonderful " be solitarily one be worth understand soul seek understand and non-availability, it is tragic
's."
It should be noted that some training datas not comprising wonderful can deliberately be added in model training, with
Training make a summary thick climbing form type judge in paragraph whether the ability comprising wonderful.
The segment (wonderful) crossed or commented on by reader, thick climbing form type of making a summary can sufficiently learn to " essence
The N metagrammar feature of color segment ", and two probability of start and end are only considered in output, it is possible to prevente effectively from serializing
The class imbalance problem when training of class method is marked, and then improves the quality for extracting wonderful.
S105 determines the object content of text to be processed according to thick climbing form type of making a summary.
In the specific implementation, text to be processed can be books, or document/text of any other length.Firstly,
Text to be processed can be split according to paragraph, to obtain multiple paragraphs, wherein if the total number of word of some paragraph is more than
Threshold value also needs to be carried out secondary splitting.Then obtained each paragraph is inputted to the thick climbing form type of abstract respectively, so as to true
Surely obtain whether each paragraph includes wonderful and export corresponding wonderful.If some paragraph does not include wonderful,
No answers is then exported, otherwise, exports corresponding wonderful.It, can after obtaining the corresponding wonderful of each paragraph
With but be not limited to for this multiple wonderful to be spliced into the object content of text to be processed, the object content be text to be processed in
Splendid contents, such as the expression content of central idea, the content that word is exquisite or flowery language is magnificent.
Optionally, after the object content for determining text to be processed, recommendation information can be shown, which includes
The object content of text (such as books) to be processed, for recommending the books to user.For example, in speed scene, by with
Family shows the splendid contents of a book, can help long article anxiety or like the reader of " light read " rapidly " to skip " entirely
Book/chapters and sections.
In embodiments of the present invention, training text is obtained first, which is that text size is more than preset threshold
Long text;Then the training information of training text is determined, which includes multiple paragraphs, training text in training text
Type information and training text reader behavioural characteristic;Secondly according to behavioural characteristic, each section is determined in multiple paragraphs
The wonderful fallen;Then type information, multiple paragraphs and wonderful input first is trained to training pattern, is obtained
It makes a summary thick climbing form type;According to thick climbing form type of making a summary, the object content of text to be processed is determined.It may be implemented for books or long article
The automatic of this splendid contents extracts and by accounting for the type information of books wonderful/content can be improved
The accuracy of extraction.
Fig. 3 is referred to, Fig. 3 is the process signal of the abstracting method of another object content provided in an embodiment of the present invention
Figure, this method includes but is not limited to following steps:
S301, obtains training text, which is the long text that text size is more than preset threshold.This step with it is upper
S101 in one embodiment is identical, this step repeats no more.
S302 determines the training information of training text, which includes multiple paragraphs, training text in training text
The behavioural characteristic of the reader of type information originally and training text.This step is identical as the S102 in a upper embodiment, this step
Suddenly it repeats no more.
S303, according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.In this step and a upper embodiment
S103 is identical, this step repeats no more.
S304 obtains thick climbing form of making a summary by the input first of type information, multiple paragraphs and wonderful to training pattern
Type.This step is identical as the S104 in a upper embodiment, this step repeats no more.
S305 determines the second training data to training pattern according to thick climbing form type of making a summary.
In the specific implementation, training text can be obtained first, wherein the training text that this step obtains is usually and step
Training text acquired in S301 is not identical, and the training text that this step obtains is also possible to the texts such as a complete books
This length is more than the long text of preset threshold.Then the training text that will acquire is divided into of length no more than threshold value (such as 504
Word) multiple paragraphs, and multiple paragraphs are sequentially input into the thick climbing form type of abstract, to obtain the wonderful of each paragraph.Its
In, second can be, but not limited to training pattern as the BERT model based on pairwise, therefore will can slightly be mentioned using abstract
Multiple wonderfuls that model determines carry out combination of two as the second training data to training pattern.Wherein, combination of two
Refer to and the wonderful for belonging to the same paragraph is subjected to combination of two, each combination is used as one group of training data.It certainly can also
Not distinguish paragraph, multiple segments are directly subjected to arbitrary combination of two.For example, wonderful A and B are combined to obtain
A+B, then A+B is one group of training data.
Training data input second is trained by S306 to training pattern, obtains essence row's model of making a summary.
In the specific implementation, determining reader comment's number or the reading for two wonderfuls that each group of training data is included first
Person's scribing line number.
Then, the tag along sort (label) that this group of training data is determined according to reader comment's number or reader's scribing line number, at this
In inventive embodiments by training data be three classes, corresponding label is respectively 1,0 and -1.By taking training data A+B as an example, such as
(4)-(6) shown in formula: 1) if reader comment's number of wonderful A or reader cross, number is greater than wonderful B, illustrates A ratio B more
It is excellent, therefore the tag along sort of A+B is determined as 1;2) if reader comment's number of wonderful A or reader cross number equal to excellent
Segment B then illustrates that the excellent degree of A and B is identical, therefore the tag along sort of A+B is determined as 0;If 3) reader of wonderful A
It comments on number or reader crosses number less than wonderful B, then illustrate that B ratio A is more excellent, therefore the tag along sort of A+B is determined as -1.
Label=1, then it represents that Rank (A) > Rank (B) (4)
Label=0, then it represents that Rank (A)=Rank (B) (5)
Label=-1, then it represents that Rank (A) < Rank (B) (6)
Then, every group of training data and corresponding tag along sort input second are trained to training pattern, are obtained
Abstract essence row's model.
S307 determines the object content of text to be processed according to thick climbing form type and the abstract essence row's model of making a summary.
In the specific implementation, multiple candidate wonderfuls can be obtained first by the thick climbing form type of text input to be processed abstract;
Then according to abstract essence row's model, the excellent degree ranking of each candidate wonderful is determined.Wherein it is possible to by multiple candidate essences
Color segment combination of two simultaneously input abstract essence row model, so as to first obtain every two candidate's wonderful excellent degree height
The excellent degree ranking between multiple candidate wonderfuls is determined therefrom that again.Then, according to excellent degree ranking, multiple times are determined
Select the target wonderful in wonderful, wherein can will come the candidate wonderful of top N as target excellent
Section, finally by target wonderful in combination as the object content of text to be processed.
Since the construction of the training data of abstract essence row's model is based on reader comment's number or reader's scribing line number, in reality
Reader conduct feature must be given up when using the model, to guarantee the reliability of model output result.Therefore implement in the present invention
In example, behavior feature is utilized by the way of popular class stepping, to make up this defect, improve excellent degree ranking standard
True property.As shown in figure 4, the determination of the excellent degree ranking of candidate wonderful is main including the following steps:
(1) determine in multiple chapters and sections in text to be processed that the corresponding reader of each chapters and sections crosses number or reader comment's number,
And according to the corresponding reader's scribing line number of each chapters and sections or reader comment's number, sequence threshold value is determined, which may include setting
Confidence threshold and stepping threshold value.Wherein it is possible to count reader's scribing line number or reader comment's number in the distribution characteristics of each chapters and sections.Example
Such as, reader the scribing line number or average value, peak and the minimum of reader comment's number etc. for determining each chapters and sections, then will be averaged
Value is used as confidence threshold value, and stepping threshold value is determined according to peak and minimum, for example, peak be 1000, it is minimum
Value is 100, then the stepping threshold value of first grade of popular class can be determined as 800, the stepping threshold value of second gear hot topic class is true
It is set to 500 and the stepping threshold value of third gear hot topic class is determined as 100.Confidence level is determined by didactic mode
Threshold value and stepping threshold value can treat the books of different shelf lifes, different sale temperatures with a certain discrimination, can be improved excellent
The accuracy of degree ranking.
(2) according to sequence threshold value and abstract essence row's model, excellent degree ranking is determined.Wherein it is possible to according to confidence level threshold
The reader comment's number or reader's scribing line number of value and each candidate wonderful, classify to multiple candidate wonderfuls, obtain
To credible wonderful and insincere wonderful, wherein a kind of possible implementation are as follows: draw reader comment's number or reader
Line number is greater than the candidate wonderful of confidence threshold value and crosses number as credible wonderful and by reader comment's number or reader
No more than confidence threshold value candidate wonderful as insincere wonderful.Based on this, on the one hand, be directed to credible excellent
Section, can first determine the popular class of each credible wonderful, wherein as the reader of credible wonderful according to stepping threshold value
When commenting on the stepping threshold value of number or reader's scribing line number higher than a certain shelves hot topic class, it is determined that credible wonderful belongs to the hot topic
Class.Popular class can be divided into multi gear, as level 1, level 2 ..., level n, specific stepping quantity can basis
The application scenarios that use and user demand determine.Credible wonderful is determined further according to popular class and abstract essence row's model
Excellent degree ranking, wherein the excellent degree ranking of the high credible wonderful of popular class is higher than popular of low grade credible
Wonderful, and the excellent degree ranking belonged between multiple credible wonderfuls of same popular class is then needed according to abstract
Essence arranges model to determine, wherein multiple credible wonderful can successively be inputted to abstract essence row's model two-by-two, to realize elder generation
Determine that it is multiple credible excellent to determine therefrom that same class includes again for the height of excellent degree between the credible wonderful of every two
Excellent degree ranking between segment.
Such as: credible wonderful includes A, B, C, D, E.Wherein, first grade of popular class includes A, B and C, second gear heat
Door class includes D and E, then A+B, A+C and B+C is sequentially input abstract essence row's model, obtain A excellent degree be higher than B and C,
And the excellent degree of B is lower than C, then the excellent degree ranking of A, B and C are followed successively by 1,3 and 2.Similarly obtain the excellent journey of D and E
Spending ranking is 2,1.It is A, C, B, E, D to obtain the whole ranking of A, B, C, D, E from high to low.
On the other hand, for insincere wonderful, insincere wonderful can be determined in each popular class first
Excellent degree at least one corresponding credible wonderful predicts ranking, for example, first grade of popular class is credible excellent
Segment includes A, B and C, then is directed to some insincere wonderful G, and the segment for being assumed to be first grade of popular class participates in
A, ranking inside the class of B, C.It is followed successively by B, A, G, C from high to low according to the ranking that abstract essence row's model obtains A, B, C and D,
That is excellent degree prediction ranking of the G in first grade of popular class is 3.Then it is insincere to predict that ranking determines according to excellent degree
The excellent degree ranking of wonderful, wherein can first determine the flat of the corresponding excellent degree prediction ranking of multiple popular class
Equal ranking determines the excellent degree ranking of insincere wonderful further according to average ranking.
Such as: the credible wonderful of first grade of popular class includes A, B and C, second gear hot topic class it is credible excellent
Segment includes D and E.Excellent degree of the insincere wonderful G in first grade of popular class and the popular class of second gear
Predict that ranking is 3 and 2, therefore the average ranking of excellent degree prediction ranking is 2.5.Again because credible wonderful A, B, C, D,
The whole ranking of E from high to low be A, C, B, E, D, so the whole ranking from high to low of A, B, C, D, E and G be A, C, G, B,
E、D。
It should be noted that as shown in figure 4, the reliable candidate's wonderful of stepping if it does not exist, i.e., there is no can
Believe wonderful, then be used directly and make a summary essence row's model to determine the excellent degree ranking between insincere wonderful.
In conclusion as shown in figure 5, the extracting method of splendid contents provided in an embodiment of the present invention includes two steps:
The thick essence row that mentions and make a summary of abstract.Wherein, abstract, which slightly mentions, is utilized the abstract model of supervision and calls together to paragraph progress wonderful
It returns, the abstract that essence of making a summary row is utilized semi-supervised order models slightly to mention global tuning abstract is as a result, final to obtain
Splendid contents.Method in the embodiment of the present invention can be applied to a variety of actual scenes, bring to user good using body
It tests.For example, first, in speed scene, made a summary using pandect, the reader of long article anxiety can be helped to skip pandect/chapters and sections.
Second, in recommending scene, it can use brief recommendation language of the splendid contents of the invention extracted as a book, to attract use
Family, which is clicked, reads or buys books.Third, in long-tail content mining scene: new book restocking or unexpected winner minority's book are promoted, can be with
It is shown by the splendid contents of extraction to user and shows the books, to solve the problems, such as the cold start-up of these books.4th, a
Property scene in, since the content in a book is multifarious, splendid contents that the present invention extracts can also draw a portrait with user and tie
It closes, to realize Individualized Notification Service etc..
In embodiments of the present invention, training text is obtained first, and determines the training information of training text, the training information
The behavioural characteristic of the reader of type information and training text including multiple paragraphs, training text in training text;Then
Can be according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.Secondly by type information, multiple paragraphs and excellent
Section input first obtains thick climbing form type of making a summary to training pattern;Then it according to the thick climbing form type of making a summary, determines second to training pattern
Training data and training data input second is trained to training pattern, obtain essence row's model of making a summary;Last basis
Thick climbing form type of making a summary and abstract essence row's model, determine the object content of text to be processed.Wherein, either there is the abstract of supervision thick
All monitoring datas of climbing form type or semi-supervised abstract essence row's model are all based on user behavior characteristics in popular book and construct,
The automatic pumping to the splendid contents of the long texts such as books can be realized under the premise of without any artificial labeled data additionally
It takes.And innovatively proposition with machine reads the method for understanding formula come to the wonderful in paragraph in the thick row's model of abstract
It is predicted, can ensure wonderful/content extraction accuracy.
It is above-mentioned to illustrate the method for the embodiment of the present invention, the relevant device of the embodiment of the present invention is provided below.
Fig. 6 is referred to, Fig. 6 is a kind of structural schematic diagram of the draw-out device of object content provided in an embodiment of the present invention,
The apparatus may include:
Sample collection module 601, for obtaining the first training text, which is that text size is more than to preset
The long text of threshold value
In the specific implementation, the first training text can be the long text that text size is more than preset threshold, preset threshold can
To refer to total number of word (such as 10,000 words), the total chapters and sections/paragraph number of training text.Wherein it is possible to but be not limited to obtain more completely
Books, and using every complete books as a training text.
Information determination module 602, for determining the training information of training text, which includes in training text
The behavioural characteristic of the reader of multiple paragraphs, the type information of training text and training text.
In the specific implementation, in a first aspect, training information may include multiple paragraphs in training text, wherein can be first
Elder generation is multiple paragraphs to training text (i.e. a complete book) progress cutting, then by each chapter construction according to chapters and sections, wherein
If the number of words of some paragraph is more than threshold value (such as 504), need to be again that two or more numbers of words do not surpass by the paragraph cutting
Cross the paragraph of the threshold value.
Second aspect, since the books/text style in different type or field is totally different, training information can also include
The type information of training text.Wherein, type information can be the books type sorted out when book publishing, and such as literature and art are hanged
Doubt novel etc..
The third aspect, training information can also include the behavioural characteristic of the reader of training text.Wherein, for the essence in book
Color content, reader generally prefer that and are recorded by way of crossing or commenting on, therefore behavioural characteristic can be at one section
In or from reader since book publishing for the scribing line number of each paragraph/chapters and sections in the books or comment number.
Information determination module 602 is also used to determine the wonderful of each paragraph in multiple paragraphs according to behavioural characteristic.
In the specific implementation, can be, but not limited to reader's scribing line number or reader comment's number in each paragraph being greater than certain threshold value
Wonderful of the segment as the paragraph.Wherein, the threshold value can be according to the shelf lifes of books, point reading/sales volume
Deng because of usually comprehensive analysis and determination.Wherein, if the reader for some paragraph occur crosses number or reader comment's number be 0 feelings
Condition, then a kind of possible countermeasure are as follows: set no answer for the wonderful of the paragraph.
Model training module 603 is instructed for inputting type information, multiple paragraphs and wonderful to training pattern
Practice, obtains thick climbing form type of making a summary.
In the specific implementation, can first serialize type information (being denoted as A), A={ a is obtained1,a2,...,an, so
Afterwards, for each paragraph, first the paragraph (being denoted as Q) is serialized, obtains Q={ q1,q2,...,qn, it recycles wait train
A and Q are spliced for distinguishing the additional character of different types of sequence in model, obtain a group model training data (note
For I), wherein can be the model based on BERT to training pattern, distinguish in BERT model for distinguishing the additional character of Q and A
For CLS and SEP, to obtain
I={ [CLS];A[SEP];Q[SEP]} (7)
Certainly, it can also be SummaRuNNer model to training pattern, but need to make the model in training
Adjustment.For example, it is desired to which the problem of resampling or down-sampled method are to alleviate class imbalance is added.
In practical applications, the final result of the thick climbing form type of making a summary output is one of start probability and end maximum probability
The corresponding text fragments in continuous and legal section.
It should be noted that some training datas not comprising wonderful can deliberately be added in model training, with
Training make a summary thick climbing form type judge in paragraph whether the ability comprising wonderful.
Text snippet module 604, for determining the object content of text to be processed according to thick climbing form type of making a summary.
In the specific implementation, text to be processed can be books, or document/text of any other length.Firstly,
Text to be processed can be split according to paragraph, to obtain multiple paragraphs, wherein if the total number of word of some paragraph is more than
Preset threshold also needs to be carried out secondary splitting.Then obtained each paragraph is inputted to the thick climbing form type of abstract respectively, with
Just determination obtains whether each paragraph includes wonderful and export corresponding wonderful, wherein if some paragraph does not include
Splendid contents then export no answers, otherwise, export corresponding wonderful.It is excellent corresponding obtaining each paragraph
After section, the object content that this multiple wonderful is spliced into text to be processed can be, but not limited to.
Optionally, the device in the embodiment of the present invention can also include display module, for determining text to be processed
After object content, recommendation information can be shown, which includes the splendid contents of text to be processed (such as books), is used for
Recommend the books to user.
Optionally, sample collection module 601 is also used to obtain the second training text, wherein the second training text usually with
First training text is not identical, and it is more than default threshold that the second training text, which is also possible to the text sizes such as a complete books,
The long text of value.
Optionally, model training module 603 are also used to determine the second instruction to training pattern according to thick climbing form type of making a summary
Practice data.Wherein it is possible to which the training text that first will acquire is divided into multiple sections of of length no more than threshold value (such as 504 words)
It falls, and multiple paragraphs is sequentially input into the thick climbing form type of abstract, to obtain the wonderful of each paragraph.Wherein, second wait instruct
Practicing model can be, but not limited to as the BERT model based on pairwise, therefore can will be determined using thick climbing form type of making a summary more
A wonderful carries out combination of two as the second training data to training pattern.Wherein, combination of two, which refers to, will belong to together
The wonderful of one paragraph carries out combination of two, and each combination is used as one group of training data.Paragraph can not certainly be distinguished,
Multiple segments are directly subjected to arbitrary combination of two.
Optionally, model training module 603 are also used to for training data input second being trained to training pattern, obtain
To abstract essence row's model.Wherein it is possible to determine the reader comment for two wonderfuls that each group of training data is included first
Several or reader scribing line number.Then, the tag along sort of this group of training data is determined according to reader comment's number or reader's scribing line number
(label), in embodiments of the present invention by training data be three classes, corresponding label is respectively 1,0 and -1.With training data A
For+B, as shown in (4)-(6) formula: 1) if reader comment's number of wonderful A or reader cross, number is greater than wonderful B,
Illustrate that A ratio B is more excellent, therefore the tag along sort of A+B is determined as 1;If 2) reader comment's number of wonderful A or reader's scribing line
Number is equal to wonderful B, then illustrates that the excellent degree of A and B is identical, therefore the tag along sort of A+B is determined as 0;If 3) excellent
Reader comment's number of segment A or reader number of crossing are less than wonderful B, then illustrate that B ratio A is more excellent, therefore by the contingency table of A+B
Label are determined as -1.Then, every group of training data and corresponding tag along sort input second are trained to training pattern, are obtained
To abstract essence row's model.
Text snippet module 604 is also used to determine text to be processed according to thick climbing form type and the abstract essence row's model of making a summary
Object content.
In the specific implementation, multiple candidate wonderfuls can be obtained first by the thick climbing form type of text input to be processed abstract;
Then according to abstract essence row's model, the excellent degree ranking of each candidate wonderful is determined.Wherein it is possible to by multiple candidate essences
Color segment combination of two simultaneously input abstract essence row model, so as to first obtain every two candidate's wonderful excellent degree height
The excellent degree ranking between multiple candidate wonderfuls is determined therefrom that again.Then, according to excellent degree ranking, multiple times are determined
Select the target wonderful in wonderful, wherein can will come the candidate wonderful of top N as target excellent
Section, finally by target wonderful in combination as the object content of text to be processed.
Since the construction of the training data of abstract essence row's model is based on reader comment's number or reader's scribing line number, in reality
Reader conduct feature must be given up when using the model, to guarantee the reliability of model output result.Therefore implement in the present invention
In example, behavior feature is utilized by the way of popular class stepping, to make up this defect, improve excellent degree ranking standard
True property.As shown in figure 4, the determination of the excellent degree ranking of candidate wonderful is main including the following steps:
(1) determine in multiple chapters and sections in text to be processed that the corresponding reader of each chapters and sections crosses number or reader comment's number,
And according to the corresponding reader's scribing line number of each chapters and sections or reader comment's number, sequence threshold value is determined, which may include setting
Confidence threshold and stepping threshold value.Wherein it is possible to count reader's scribing line number or reader comment's number in the distribution characteristics of each chapters and sections.Example
Such as, reader the scribing line number or average value, peak and the minimum of reader comment's number etc. for determining each chapters and sections, then will be averaged
Value is used as confidence threshold value, and stepping threshold value is determined according to peak and minimum, for example, peak be 1000, it is minimum
Value is 100, then the stepping threshold value of first grade of popular class can be determined as 800, the stepping threshold value of second gear hot topic class is true
It is set to 500 and the stepping threshold value of third gear hot topic class is determined as 100.Confidence level is determined by didactic mode
Threshold value and stepping threshold value can treat the books of different shelf lifes, different sale temperatures with a certain discrimination, can be improved excellent
The accuracy of degree ranking.
(2) according to sequence threshold value and abstract essence row's model, excellent degree ranking is determined.Wherein it is possible to according to confidence level threshold
The reader comment's number or reader's scribing line number of value and each candidate wonderful, classify to multiple candidate wonderfuls, obtain
To credible wonderful and one of possible implementation of insincere wonderful are as follows: reader comment's number or reader are crossed
The candidate wonderful that number is greater than confidence threshold value crosses number not as credible wonderful and by reader comment's number or reader
Greater than confidence threshold value candidate wonderful as insincere wonderful.Based on this, on the one hand, be directed to credible excellent
Section, can first determine the popular class of each credible wonderful, wherein as the reader of credible wonderful according to stepping threshold value
When commenting on the stepping threshold value of number or reader's scribing line number higher than a certain shelves hot topic class, it is determined that credible wonderful belongs to the hot topic
Class.Popular class is divided into multi gear, as level 1, level 2 ..., level n, specific stepping quantity can be according to using
Application scenarios and user demand determine.The excellent of credible wonderful is determined further according to popular class and abstract essence row's model
Degree ranking, wherein the excellent degree ranking of the high credible wonderful of popular class is higher than popular of low grade credible excellent
Segment, and the excellent degree ranking belonged between multiple credible wonderfuls of same popular class is then needed according to abstract essence row
Model determines, wherein can by multiple credible wonderful successively input abstract essence row's model two-by-two, first determined with realizing
The height of excellent degree determines therefrom that excellent between the multiple credible wonderful again between every two is credible wonderful
The ranking of degree.
Such as: credible wonderful includes A, B, C, D, E.Wherein, first grade of popular class includes A, B and C, second gear heat
Door class includes D and E, A+B, A+C and B+C are then sequentially input into abstract essence row's model, obtain A excellent degree be higher than B and
C, and the excellent degree of B is lower than C, then the excellent degree ranking of A, B and C are followed successively by 1,3 and 2.Similarly obtain the excellent of D and E
Degree ranking is 2,1.It is A, C, B, E, D to obtain the whole ranking of A, B, C, D, E from high to low.
On the other hand, for insincere wonderful, insincere wonderful can be determined in each popular class first
Excellent degree at least one corresponding credible wonderful predicts ranking, for example, first grade of popular class is credible excellent
Segment includes A, B and C, then is directed to some insincere wonderful G, is assumed to be first grade of popular class and participates in A, B, C
Class inside ranking.It is followed successively by B, A, G, C from high to low according to the ranking that abstract essence row's model obtains A, B, C and D, i.e. G exists
Excellent degree prediction ranking in first grade of popular class is 3.Then it is insincere excellent to predict that ranking determines according to excellent degree
The excellent degree ranking of segment, wherein can first determine the average row of the corresponding excellent degree prediction ranking of multiple popular class
Name, further according to average ranking, determines the excellent degree ranking of insincere wonderful.
It should be noted that as shown in figure 4, the reliable candidate's wonderful of stepping if it does not exist, i.e., there is no can
Believe wonderful, then be used directly and make a summary essence row's model to determine the excellent degree ranking between insincere wonderful.
In embodiments of the present invention, training text is obtained first, and determines the training information of training text, the training information
The behavioural characteristic of the reader of type information and training text including multiple paragraphs, training text in training text;Then
Can be according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.Secondly by type information, multiple paragraphs and excellent
Section input first obtains thick climbing form type of making a summary to training pattern;Then it according to the thick climbing form type of making a summary, determines second to training pattern
Training data and training data input second is trained to training pattern, obtain essence row's model of making a summary;Last basis
Thick climbing form type of making a summary and abstract essence row's model, determine the object content of text to be processed.Wherein, either there is the abstract of supervision thick
All monitoring datas of climbing form type or semi-supervised abstract essence row's model are all based on user behavior characteristics in popular book and construct,
The automatic pumping to the splendid contents of the long texts such as books can be realized under the premise of without any artificial labeled data additionally
It takes.And innovatively proposition with machine reads the method for understanding formula come to the wonderful in paragraph in the thick row's model of abstract
It is predicted, can ensure wonderful/content extraction accuracy.
Fig. 7 is referred to, Fig. 7 is a kind of structural schematic diagram of the extracting device of object content provided in an embodiment of the present invention.
As shown, the equipment may include: at least one processor 701, at least one communication interface 702, at least one processor
703 and at least one communication bus 704.
Wherein, processor 701 can be central processor unit, general processor, digital signal processor, dedicated integrated
Circuit, field programmable gate array or other programmable logic device, transistor logic, hardware component or it is any
Combination.It, which may be implemented or executes, combines various illustrative logic blocks, module and electricity described in the disclosure of invention
Road.The processor is also possible to realize the combination of computing function, such as combines comprising one or more microprocessors, number letter
Number processor and the combination of microprocessor etc..Communication bus 704 can be Peripheral Component Interconnect standard PCI bus or extension work
Industry normal structure eisa bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for indicate,
It is only indicated with a thick line in Fig. 7, it is not intended that an only bus or a type of bus.Communication bus 704 is used for
Realize the connection communication between these components.Wherein, the communication interface 702 of equipment is used for and other nodes in the embodiment of the present invention
Equipment carries out the communication of signaling or data.Memory 703 may include volatile memory, such as non-volatile dynamic random is deposited
Take memory (Nonvolatile Random Access Memory, NVRAM), phase change random access memory (Phase
Change RAM, PRAM), magnetic-resistance random access memory (Magetoresistive RAM, MRAM) etc., can also include non-
Volatile memory, for example, at least a disk memory, Electrical Erasable programmable read only memory (Electrically
Erasable Programmable Read-Only Memory, EEPROM), flush memory device, such as anti-or flash memory (NOR
Flash memory) or anti-and flash memory (NAND flash memory), semiconductor devices, such as solid state hard disk (Solid
State Disk, SSD) etc..Memory 703 optionally can also be that at least one is located remotely from the storage of aforementioned processor 701
Device.Batch processing code is stored in memory 703, and processor 701 executes the program in memory 703:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include more in first training text
The behavioural characteristic of the reader of a paragraph, the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern,
Obtain thick climbing form type of making a summary;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
Optionally, processor 701 is also used to perform the following operations step:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Optionally, processor 701 is also used to perform the following operations step:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined
Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined
Marking content includes the target wonderful.
Optionally, the behavioural characteristic includes reader comment's number or reader's scribing line number;
Processor 701 is also used to perform the following operations step:
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in
The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck
Want essence row's model.
Optionally, the text to be processed includes multiple chapters and sections;
Processor 701 is also used to perform the following operations step:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold
Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Optionally, processor 701 is also used to perform the following operations step:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful,
Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute
State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class
Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking
Name.
Optionally, processor 701 is also used to perform the following operations step:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Optionally, processor 701 is also used to perform the following operations step:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can
Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined
Ranking.
Optionally, processor 701 is also used to perform the following operations step:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user
This.
Further, processor can also be matched with memory and communication interface, execute mesh in foregoing invention embodiment
Mark the operation of the draw-out device of content.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program
Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or
It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter
Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium
In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer
Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center
User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or
Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or
It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with
It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk
Solid State Disk (SSD)) etc..
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in
Within protection scope of the present invention.
Claims (10)
1. a kind of abstracting method of object content, which is characterized in that the described method includes:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include multiple sections in first training text
It falls, the behavioural characteristic of the reader of the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern, is obtained
It makes a summary thick climbing form type;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
2. the method as described in claim 1, which is characterized in that it is described according to the thick climbing form type of abstract, determine text to be processed
This object content includes:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
3. method according to claim 2, which is characterized in that described according to the thick climbing form type of the abstract and abstract essence row
Model determines that the object content includes:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent degree of each candidate wonderful in the multiple candidate wonderful is determined
Ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful is determined, in the target
Holding includes the target wonderful.
4. method according to claim 2, which is characterized in that the behavioural characteristic includes reader comment's number or reader's scribing line
Number;
It is described according to the thick climbing form type of the abstract, determine that second includes: to the training data of training pattern
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined;
The multiple wonderful progress combination of two is obtained into the training data;
Described second to be trained training data input to training pattern, the essence row's model that obtains making a summary includes:
According to the reader's scribing line number or reader comment's number of two wonderfuls that the training data is included, the training is determined
The tag along sort of data;
The training data and tag along sort input described second are trained to training pattern, obtain the abstract essence
Arrange model.
5. method as claimed in claim 3, which is characterized in that the text to be processed includes multiple chapters and sections;
It is described according to abstract essence row's model, determine the excellent of each candidate wonderful in the multiple candidate wonderful
Degree ranking includes:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold value packet
Include confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
6. method as claimed in claim 5, which is characterized in that described according to the sequence threshold value and abstract essence row's mould
Type determines that the excellent degree ranking includes:
According to the confidence threshold value and the reader comment's number or reader's scribing line number of each candidate wonderful, to institute
It states multiple candidate wonderfuls to classify, obtains credible wonderful and insincere wonderful;
It determines the popular class of the credible wonderful according to the stepping threshold value, and according to the popular class and described plucks
Essence row's model is wanted to determine the excellent degree ranking of the credible wonderful;And
Determine excellent journey of the insincere wonderful at least one corresponding credible wonderful of each hot topic class
Degree prediction ranking, and predict that ranking determines the excellent degree ranking of the insincere wonderful according to the excellent degree.
7. method as claimed in claim 6, which is characterized in that it is described that ranking is predicted according to the excellent degree, determine described in
The excellent degree ranking of insincere wonderful includes:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
8. method as claimed in claim 6, which is characterized in that described according to the popular class and abstract essence row's model
The excellent degree ranking for determining the credible wonderful includes:
Determine that the excellent degree ranking of the high credible wonderful of the popular class is higher than the hot topic credible essence of low grade
Color segment;And
According to abstract essence row's model, the excellent degree row between the identical credible wonderful of the popular class is determined
Name.
9. the method according to claim 1, which is characterized in that it is described according to the thick climbing form type of abstract, it determines
After the object content of text to be processed, further includes:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user.
10. a kind of draw-out device of object content, which is characterized in that described device includes:
Sample collection module, for obtaining the first training sample, first training text is that text size is more than preset threshold
Long text;
Information determination module, for determining that the training information of first training text, the training information include described first
The row of the reader of the type information and first training text of multiple paragraphs, first training text in training text
It is characterized;
The information determination module, is also used to according to the behavioural characteristic, determines the excellent of each paragraph in the multiple paragraph
Segment;
Model training module, for the type information and wonderful input first to be trained to training pattern,
Obtain thick climbing form type of making a summary;
Text snippet module, for determining the object content of text to be processed according to the thick climbing form type of abstract.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910716302.1A CN110427482B (en) | 2019-07-31 | 2019-07-31 | Target content extraction method and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910716302.1A CN110427482B (en) | 2019-07-31 | 2019-07-31 | Target content extraction method and related equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110427482A true CN110427482A (en) | 2019-11-08 |
CN110427482B CN110427482B (en) | 2024-07-23 |
Family
ID=68414062
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910716302.1A Active CN110427482B (en) | 2019-07-31 | 2019-07-31 | Target content extraction method and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110427482B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143551A (en) * | 2019-12-04 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Text preprocessing method, classification method, device and equipment |
CN112749544A (en) * | 2020-12-28 | 2021-05-04 | 苏州思必驰信息科技有限公司 | Training method and system for paragraph segmentation model |
CN112800465A (en) * | 2021-02-09 | 2021-05-14 | 第四范式(北京)技术有限公司 | Method and device for processing text data to be labeled, electronic equipment and medium |
CN113035310A (en) * | 2019-12-25 | 2021-06-25 | 医渡云(北京)技术有限公司 | Deep learning-based medical RCT report analysis method and device |
WO2021135469A1 (en) * | 2020-06-17 | 2021-07-08 | 平安科技(深圳)有限公司 | Machine learning-based information extraction method, apparatus, computer device, and medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104885081A (en) * | 2012-12-27 | 2015-09-02 | 触摸式有限公司 | Search system and corresponding method |
US20170213130A1 (en) * | 2016-01-21 | 2017-07-27 | Ebay Inc. | Snippet extractor: recurrent neural networks for text summarization at industry scale |
-
2019
- 2019-07-31 CN CN201910716302.1A patent/CN110427482B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104885081A (en) * | 2012-12-27 | 2015-09-02 | 触摸式有限公司 | Search system and corresponding method |
US20170213130A1 (en) * | 2016-01-21 | 2017-07-27 | Ebay Inc. | Snippet extractor: recurrent neural networks for text summarization at industry scale |
Non-Patent Citations (4)
Title |
---|
何海江 等: "由排序支持向量机抽取博客文章的摘要", 电子科技大学学报, no. 04, 30 July 2010 (2010-07-30) * |
王帅 等: "TP-AS:一种面向长文本的两阶段自动摘要方法", 中文信息学报, no. 06, 15 June 2018 (2018-06-15) * |
王晗 等: "针对用户兴趣的视频精彩片段提取", 中国图象图形学报, no. 05, 16 May 2018 (2018-05-16) * |
陈海华 等: "基于引文上下文的学术文本自动摘要技术研究", 数字图书馆论坛, no. 08, 25 August 2016 (2016-08-25) * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143551A (en) * | 2019-12-04 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Text preprocessing method, classification method, device and equipment |
CN113035310A (en) * | 2019-12-25 | 2021-06-25 | 医渡云(北京)技术有限公司 | Deep learning-based medical RCT report analysis method and device |
CN113035310B (en) * | 2019-12-25 | 2024-01-09 | 医渡云(北京)技术有限公司 | Medical RCT report analysis method and device based on deep learning |
WO2021135469A1 (en) * | 2020-06-17 | 2021-07-08 | 平安科技(深圳)有限公司 | Machine learning-based information extraction method, apparatus, computer device, and medium |
CN112749544A (en) * | 2020-12-28 | 2021-05-04 | 苏州思必驰信息科技有限公司 | Training method and system for paragraph segmentation model |
CN112749544B (en) * | 2020-12-28 | 2024-04-30 | 思必驰科技股份有限公司 | Training method and system of paragraph segmentation model |
CN112800465A (en) * | 2021-02-09 | 2021-05-14 | 第四范式(北京)技术有限公司 | Method and device for processing text data to be labeled, electronic equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN110427482B (en) | 2024-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110427482A (en) | A kind of abstracting method and relevant device of object content | |
CN110795543B (en) | Unstructured data extraction method, device and storage medium based on deep learning | |
Tian et al. | Towards predicting the best answers in community-based question-answering services | |
CN108304373B (en) | Semantic dictionary construction method and device, storage medium and electronic device | |
CN108629043A (en) | Extracting method, device and the storage medium of webpage target information | |
CN110442841A (en) | Identify method and device, the computer equipment, storage medium of resume | |
CN104025085A (en) | Systems And Methods For Providing Information Regarding Semantic Entities Included In A Page Of Content | |
CN113392651B (en) | Method, device, equipment and medium for training word weight model and extracting core words | |
CN112231485B (en) | Text recommendation method and device, computer equipment and storage medium | |
CN109902271A (en) | Text data mask method, device, terminal and medium based on transfer learning | |
US10915756B2 (en) | Method and apparatus for determining (raw) video materials for news | |
CN112131881B (en) | Information extraction method and device, electronic equipment and storage medium | |
CN109325146A (en) | A kind of video recommendation method, device, storage medium and server | |
CN109492230A (en) | A method of insurance contract key message is extracted based on textview field convolutional neural networks interested | |
CN109582788A (en) | Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing | |
CN107247751A (en) | Content recommendation method based on LDA topic models | |
CN113011126B (en) | Text processing method, text processing device, electronic equipment and computer readable storage medium | |
CN111708878A (en) | Method, device, storage medium and equipment for extracting sports text abstract | |
CN113204624A (en) | Multi-feature fusion text emotion analysis model and device | |
CN113689144A (en) | Quality assessment system and method for product description | |
CN117390140B (en) | Chinese aspect emotion analysis method and system based on machine reading understanding | |
CN116956866A (en) | Scenario data processing method, apparatus, device, storage medium and program product | |
Aurnhammer et al. | Manual Annotation of Unsupervised Models: Close and Distant Reading of Politics on Reddit. | |
CN104462151A (en) | Method for evaluating web page publishing time and related device | |
Gao et al. | An attention-based ID-CNNs-CRF model for named entity recognition on clinical electronic medical records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |