CN110427482A - A kind of abstracting method and relevant device of object content - Google Patents

A kind of abstracting method and relevant device of object content Download PDF

Info

Publication number
CN110427482A
CN110427482A CN201910716302.1A CN201910716302A CN110427482A CN 110427482 A CN110427482 A CN 110427482A CN 201910716302 A CN201910716302 A CN 201910716302A CN 110427482 A CN110427482 A CN 110427482A
Authority
CN
China
Prior art keywords
wonderful
training
text
abstract
ranking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910716302.1A
Other languages
Chinese (zh)
Other versions
CN110427482B (en
Inventor
童国烽
譚翊章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910716302.1A priority Critical patent/CN110427482B/en
Publication of CN110427482A publication Critical patent/CN110427482A/en
Application granted granted Critical
Publication of CN110427482B publication Critical patent/CN110427482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of abstracting method of object content and relevant devices, comprising: acquisition training text first;Then the training information of training text is determined;Secondly according to training information, the wonderful of each paragraph in multiple paragraphs is determined;It is trained to obtain thick climbing form type of making a summary to training pattern to first further according to training information and wonderful;Then according to the thick climbing form type of making a summary, determine that second is made a summary essence row's model to the training data of training pattern with training;Finally, according to thick climbing form type and the smart object content arranged model and determine text to be processed of abstract of making a summary.Using the embodiment of the present invention, the automatic extraction of the splendid contents for books or long text may be implemented.

Description

A kind of abstracting method and relevant device of object content
Technical field
The present invention relates to the abstracting methods and correlation of natural language processing technique field more particularly to a kind of object content to set It is standby.
Background technique
With the rapid development of Internet technology, people receive the information of magnanimity daily.In order to rapidly believe from magnanimity Information needed is obtained in breath, abstract/splendid contents abstracting method becomes hot research technology.Currently, plucking of using has been put into Want/splendid contents abstracting method includes: the most common unsupervised TextRank algorithm of (1) industry, which is substantially A kind of sort algorithm based on figure.(2) classical to have supervision extraction-type digest algorithm, such as SummaRuNNer model, the model Main thought is to turn to extraction-type abstract mission profile to do serializing mark to sentence.However, on the one hand, it is calculated with TextRank Method is that the unsupervised algorithm of representative can only be believed between not can avoid sentence in view of the semantic information of shallow-layer between sentence, the abstract of generation The problem of ceasing redundancy, and some surfaces (such as reader conduct feature) cannot be made full use of.On the other hand, with SummaRuNNer model has supervision abstract model not make full use of the information of pre-training for representative, and not can guarantee life At abstract be the segment that is mutually related.The third aspect, above-mentioned two classes method do not simply fail to the type information for directly utilizing text, And it can not directly migrate in the splendid contents extraction task of the long texts such as books.
Summary of the invention
The present invention provides the abstracting method and relevant device of a kind of object content, may be implemented for books or long text The automatic extraction of splendid contents.
In a first aspect, the embodiment of the invention provides a kind of abstracting methods of object content, comprising:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include more in first training text The behavioural characteristic of the reader of a paragraph, the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern, Obtain thick climbing form type of making a summary;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
Wherein, described according to the thick climbing form type of abstract, determine that the object content of text to be processed includes:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Wherein, described according to the thick climbing form type of the abstract and abstract essence row's model, determine that the object content includes:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined Marking content includes the target wonderful.
Wherein, the behavioural characteristic includes reader comment's number or reader's scribing line number;
It is described according to the thick climbing form type of the abstract, determine that second includes: to the training data of training pattern
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
Described second to be trained training data input to training pattern, the essence row's model that obtains making a summary includes:
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck Want essence row's model.
Wherein, the text to be processed includes multiple chapters and sections;
It is described that model is arranged according to the abstract essence, determine each candidate wonderful in the multiple candidate wonderful Excellent degree ranking includes:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Wherein, described according to the sequence threshold value and abstract essence row's model, determine that the excellent degree ranking includes:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful, Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking Name.
Wherein, the excellent degree ranking that the insincere wonderful is determined according to the excellent degree prediction ranking Include:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Wherein, described that the excellent of the credible wonderful is determined according to the popular class and abstract essence row's model Degree ranking includes:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined Ranking.
Wherein, described according to the thick climbing form type of abstract, after the object content for determining text to be processed, further includes:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user This.
Second aspect, the embodiment of the invention provides a kind of draw-out devices of object content, comprising:
Sample collection module, for obtaining the first training sample, first training text is that text size is more than to preset The long text of threshold value;
Information determination module, for determining that the training information of first training text, the training information include described The type information of multiple paragraphs, first training text in first training text and the reader of first training text Behavioural characteristic;
The information determination module is also used to determine each paragraph in the multiple paragraph according to the behavioural characteristic Wonderful;
Model training module, for instructing the type information and wonderful input first to training pattern Practice, obtains thick climbing form type of making a summary;
Text snippet module, for determining the object content of text to be processed according to the thick climbing form type of abstract.
Wherein, the model training module is also used to:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
The text snippet module is also used to:
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Wherein, the text snippet module is also used to:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined Marking content includes the target wonderful.
Wherein, the behavioural characteristic includes reader comment's number or reader's scribing line number;
The sample collection module is also used to:
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
The information determination module is also used to:
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
The model training module is also used to:
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck Want essence row's model.
Wherein, the text to be processed includes multiple chapters and sections;
The text snippet module is also used to:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Wherein, the text snippet module is also used to:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful, Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking Name.
Wherein, the text snippet module is also used to:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Wherein, the text snippet module is also used to:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined Ranking.
Wherein, described device further includes display module, is used for:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user This.
The third aspect, the embodiment of the invention provides a kind of extracting devices of object content, comprising: processor, memory And communication bus, wherein communication bus executes in memory for realizing connection communication between processor and memory, processor Step in a kind of abstracting method for object content that the program of storage provides for realizing above-mentioned first aspect.
In a possible design, Entity recognition equipment provided by the invention be may include for executing in the above method The corresponding module of behavior.Module can be software and/or hardware.
The another aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage A plurality of instruction is stored in medium, described instruction is suitable for being loaded as processor and executing method described in above-mentioned various aspects.
The another aspect of the embodiment of the present invention provides a kind of computer program product comprising instruction, when it is in computer When upper operation, so that computer executes method described in above-mentioned various aspects.
Implement the embodiment of the present invention, first acquisition training text;Then the training information of training text is determined;Secondly basis Training information determines the wonderful of each paragraph in multiple paragraphs;Further according to training information and wonderful to first wait instruct Practice model and is trained to obtain thick climbing form type of making a summary;Then according to thick climbing form type of making a summary, the second training to training pattern is determined Data are with training abstract essence row's model;Finally, according to thick climbing form type and the smart mesh arranged model and determine text to be processed of abstract of making a summary Mark content.The automatic extraction of the splendid contents for books or long text may be implemented.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly or in background technique below will be implemented the present invention Attached drawing needed in example or background technique is illustrated.
Fig. 1 is a kind of flow diagram of the abstracting method of object content provided in an embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of thick climbing form type of making a summary provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of the abstracting method of another object content provided in an embodiment of the present invention;
Fig. 4 is a kind of flow diagram of excellent degree ranking provided in an embodiment of the present invention;
Fig. 5 is a kind of flow diagram of two stage object content abstracting method provided in an embodiment of the present invention;
Fig. 6 is a kind of draw-out device structural schematic diagram of object content provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of the extracting device of object content provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Referring to Figure 1, Fig. 1 is a kind of flow diagram of the abstracting method of object content provided in an embodiment of the present invention, This method includes but is not limited to following steps:
S101, obtains training text, which is the long text that text size is more than preset threshold.
In the specific implementation, training text can be the long text that text size is more than preset threshold, preset threshold be can be Refer to total number of word (such as 10,000 words), the total chapters and sections/paragraph number of training text.Wherein it is possible to but be not limited to obtain more complete books Nationality, and using every complete books as a training text.
S102 determines the training information of training text, which includes multiple paragraphs, training text in training text The behavioural characteristic of the reader of type information originally and training text.
In the specific implementation, in a first aspect, training information may include multiple paragraphs in training text, wherein can be first Elder generation is multiple paragraphs to training text (i.e. a complete book) progress cutting, then by each chapter construction according to chapters and sections, wherein If the number of words of some paragraph more than threshold value (such as 504), needs again, by the paragraph, cutting is two or more numbers of words again No more than the paragraph of the threshold value.For example, including 3 chapters and sections in books XXX, wherein include that 3 numbers of words are less than in the 1st chapters and sections It include being less than in paragraph and the 3rd chapters and sections of 4 numbers of words less than 504 including 2 numbers of words in 504 paragraph, the 2nd chapters and sections 504 paragraph and 1 number of words are 896 paragraph, therefore then need the paragraph for being 896 by number of words to be divided into and separately include 504 With two paragraphs of 392 words, to obtain 3+4+2+2=11 paragraph in the training text.
Second aspect, since the books/text style in different type or field is totally different, training information can also include The type information of training text.Wherein, type information can be the books type sorted out when book publishing, and such as literature and art are hanged Doubt novel etc..
The third aspect, training information can also include the behavioural characteristic of the reader of training text.Wherein, for the essence in book Color content, reader generally prefer that and are recorded by way of crossing or commenting on, therefore behavioural characteristic can be at one section In or from reader since book publishing for the scribing line number of each paragraph/chapters and sections in the books or comment number.
S103 determines the wonderful of each paragraph in multiple paragraphs according to behavioural characteristic.
In the specific implementation, can be, but not limited to reader's scribing line number or reader comment's number in each paragraph being greater than certain threshold value Wonderful of the segment as the paragraph.Wherein, the threshold value can be according to the shelf lifes of books, point reading/sales volume Deng because of usually comprehensive analysis and determination.Wherein, if the reader for some paragraph occur crosses number or reader comment's number be 0 feelings Condition, then a kind of possible countermeasure are as follows: set no answer for the wonderful of the paragraph.
Type information, multiple paragraphs and wonderful are inputted and are trained to training pattern by S104, are obtained abstract and are slightly mentioned Model.
In the specific implementation, can first serialize type information (being denoted as A), A={ a is obtained1,a2,...,an, so Afterwards, for each paragraph, first the paragraph (being denoted as Q) is serialized, obtains Q={ q1,q2,...,qn, it recycles wait train A and Q are spliced for distinguishing the additional character of different types of sequence in model, obtain a group model training data (note For I), wherein can be to training pattern based on BERT (Bidirectio-nal Encoder Representation from Transformers model) for distinguishing the additional character of Q and A is respectively CLS and SEP in BERT model, to obtain
I={ [CLS];A[SEP];Q[SEP]} (1)
Certainly, it can also be SummaRuNNer model to training pattern, but need to make the model in training Adjustment.For example, it is desired to which the problem of resampling or down-sampled method are to alleviate class imbalance is added.
Such as: it is illustrated in figure 2 the thick climbing form type of abstract that training obtains, which utilizes complete self-consciou power mechanism Feature Mapping including insertion expression layer, BERT layers, abstraction, layer.Wherein, insertion expression layer can be divided into character/word insertion table again Show that (token embedding) layer, paragraph insertion indicate that (segment embedding) layer and position insertion indicate (position embedding) layer, for carrying out vectorization expression to input text from multiple dimensions.As shown, input letter The input of i-th of position of breath is by obtaining the vector T of a hidden layer after BERT layersi, wherein i-th of position is input I-th of word where position, for example, in Fig. 2, the position where " " be position where the 3rd position, " only " i.e. For the 7th position.Then corresponding starting (start) probability in each position and termination are calculated using an abstraction, layer (end) probability, wherein start probability indicates that the position is that probability, the end probability of the starting point of wonderful indicate the position For the probability of the end point of wonderful.Using start matrix S and end the matrix E learnt during model training come into Row eigentransformation, the start probability that each position can be obtained (are denoted as P1) and end probability (be denoted as P2), wherein i-th of position P1And P2It can be calculated respectively according to (2) formula and (3) formula.
In practical applications, the final result of the thick climbing form type of making a summary output is one of start probability and end maximum probability The corresponding text fragments in continuous and legal section.For example, as shown in Fig. 2, it " is solitarily one that thick climbing form type of making a summary, which will export, ... " where paragraph wonderful " be solitarily one be worth understand soul seek understand and non-availability, it is tragic 's."
It should be noted that some training datas not comprising wonderful can deliberately be added in model training, with Training make a summary thick climbing form type judge in paragraph whether the ability comprising wonderful.
The segment (wonderful) crossed or commented on by reader, thick climbing form type of making a summary can sufficiently learn to " essence The N metagrammar feature of color segment ", and two probability of start and end are only considered in output, it is possible to prevente effectively from serializing The class imbalance problem when training of class method is marked, and then improves the quality for extracting wonderful.
S105 determines the object content of text to be processed according to thick climbing form type of making a summary.
In the specific implementation, text to be processed can be books, or document/text of any other length.Firstly, Text to be processed can be split according to paragraph, to obtain multiple paragraphs, wherein if the total number of word of some paragraph is more than Threshold value also needs to be carried out secondary splitting.Then obtained each paragraph is inputted to the thick climbing form type of abstract respectively, so as to true Surely obtain whether each paragraph includes wonderful and export corresponding wonderful.If some paragraph does not include wonderful, No answers is then exported, otherwise, exports corresponding wonderful.It, can after obtaining the corresponding wonderful of each paragraph With but be not limited to for this multiple wonderful to be spliced into the object content of text to be processed, the object content be text to be processed in Splendid contents, such as the expression content of central idea, the content that word is exquisite or flowery language is magnificent.
Optionally, after the object content for determining text to be processed, recommendation information can be shown, which includes The object content of text (such as books) to be processed, for recommending the books to user.For example, in speed scene, by with Family shows the splendid contents of a book, can help long article anxiety or like the reader of " light read " rapidly " to skip " entirely Book/chapters and sections.
In embodiments of the present invention, training text is obtained first, which is that text size is more than preset threshold Long text;Then the training information of training text is determined, which includes multiple paragraphs, training text in training text Type information and training text reader behavioural characteristic;Secondly according to behavioural characteristic, each section is determined in multiple paragraphs The wonderful fallen;Then type information, multiple paragraphs and wonderful input first is trained to training pattern, is obtained It makes a summary thick climbing form type;According to thick climbing form type of making a summary, the object content of text to be processed is determined.It may be implemented for books or long article The automatic of this splendid contents extracts and by accounting for the type information of books wonderful/content can be improved The accuracy of extraction.
Fig. 3 is referred to, Fig. 3 is the process signal of the abstracting method of another object content provided in an embodiment of the present invention Figure, this method includes but is not limited to following steps:
S301, obtains training text, which is the long text that text size is more than preset threshold.This step with it is upper S101 in one embodiment is identical, this step repeats no more.
S302 determines the training information of training text, which includes multiple paragraphs, training text in training text The behavioural characteristic of the reader of type information originally and training text.This step is identical as the S102 in a upper embodiment, this step Suddenly it repeats no more.
S303, according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.In this step and a upper embodiment S103 is identical, this step repeats no more.
S304 obtains thick climbing form of making a summary by the input first of type information, multiple paragraphs and wonderful to training pattern Type.This step is identical as the S104 in a upper embodiment, this step repeats no more.
S305 determines the second training data to training pattern according to thick climbing form type of making a summary.
In the specific implementation, training text can be obtained first, wherein the training text that this step obtains is usually and step Training text acquired in S301 is not identical, and the training text that this step obtains is also possible to the texts such as a complete books This length is more than the long text of preset threshold.Then the training text that will acquire is divided into of length no more than threshold value (such as 504 Word) multiple paragraphs, and multiple paragraphs are sequentially input into the thick climbing form type of abstract, to obtain the wonderful of each paragraph.Its In, second can be, but not limited to training pattern as the BERT model based on pairwise, therefore will can slightly be mentioned using abstract Multiple wonderfuls that model determines carry out combination of two as the second training data to training pattern.Wherein, combination of two Refer to and the wonderful for belonging to the same paragraph is subjected to combination of two, each combination is used as one group of training data.It certainly can also Not distinguish paragraph, multiple segments are directly subjected to arbitrary combination of two.For example, wonderful A and B are combined to obtain A+B, then A+B is one group of training data.
Training data input second is trained by S306 to training pattern, obtains essence row's model of making a summary.
In the specific implementation, determining reader comment's number or the reading for two wonderfuls that each group of training data is included first Person's scribing line number.
Then, the tag along sort (label) that this group of training data is determined according to reader comment's number or reader's scribing line number, at this In inventive embodiments by training data be three classes, corresponding label is respectively 1,0 and -1.By taking training data A+B as an example, such as (4)-(6) shown in formula: 1) if reader comment's number of wonderful A or reader cross, number is greater than wonderful B, illustrates A ratio B more It is excellent, therefore the tag along sort of A+B is determined as 1;2) if reader comment's number of wonderful A or reader cross number equal to excellent Segment B then illustrates that the excellent degree of A and B is identical, therefore the tag along sort of A+B is determined as 0;If 3) reader of wonderful A It comments on number or reader crosses number less than wonderful B, then illustrate that B ratio A is more excellent, therefore the tag along sort of A+B is determined as -1.
Label=1, then it represents that Rank (A) > Rank (B) (4)
Label=0, then it represents that Rank (A)=Rank (B) (5)
Label=-1, then it represents that Rank (A) < Rank (B) (6)
Then, every group of training data and corresponding tag along sort input second are trained to training pattern, are obtained Abstract essence row's model.
S307 determines the object content of text to be processed according to thick climbing form type and the abstract essence row's model of making a summary.
In the specific implementation, multiple candidate wonderfuls can be obtained first by the thick climbing form type of text input to be processed abstract; Then according to abstract essence row's model, the excellent degree ranking of each candidate wonderful is determined.Wherein it is possible to by multiple candidate essences Color segment combination of two simultaneously input abstract essence row model, so as to first obtain every two candidate's wonderful excellent degree height The excellent degree ranking between multiple candidate wonderfuls is determined therefrom that again.Then, according to excellent degree ranking, multiple times are determined Select the target wonderful in wonderful, wherein can will come the candidate wonderful of top N as target excellent Section, finally by target wonderful in combination as the object content of text to be processed.
Since the construction of the training data of abstract essence row's model is based on reader comment's number or reader's scribing line number, in reality Reader conduct feature must be given up when using the model, to guarantee the reliability of model output result.Therefore implement in the present invention In example, behavior feature is utilized by the way of popular class stepping, to make up this defect, improve excellent degree ranking standard True property.As shown in figure 4, the determination of the excellent degree ranking of candidate wonderful is main including the following steps:
(1) determine in multiple chapters and sections in text to be processed that the corresponding reader of each chapters and sections crosses number or reader comment's number, And according to the corresponding reader's scribing line number of each chapters and sections or reader comment's number, sequence threshold value is determined, which may include setting Confidence threshold and stepping threshold value.Wherein it is possible to count reader's scribing line number or reader comment's number in the distribution characteristics of each chapters and sections.Example Such as, reader the scribing line number or average value, peak and the minimum of reader comment's number etc. for determining each chapters and sections, then will be averaged Value is used as confidence threshold value, and stepping threshold value is determined according to peak and minimum, for example, peak be 1000, it is minimum Value is 100, then the stepping threshold value of first grade of popular class can be determined as 800, the stepping threshold value of second gear hot topic class is true It is set to 500 and the stepping threshold value of third gear hot topic class is determined as 100.Confidence level is determined by didactic mode Threshold value and stepping threshold value can treat the books of different shelf lifes, different sale temperatures with a certain discrimination, can be improved excellent The accuracy of degree ranking.
(2) according to sequence threshold value and abstract essence row's model, excellent degree ranking is determined.Wherein it is possible to according to confidence level threshold The reader comment's number or reader's scribing line number of value and each candidate wonderful, classify to multiple candidate wonderfuls, obtain To credible wonderful and insincere wonderful, wherein a kind of possible implementation are as follows: draw reader comment's number or reader Line number is greater than the candidate wonderful of confidence threshold value and crosses number as credible wonderful and by reader comment's number or reader No more than confidence threshold value candidate wonderful as insincere wonderful.Based on this, on the one hand, be directed to credible excellent Section, can first determine the popular class of each credible wonderful, wherein as the reader of credible wonderful according to stepping threshold value When commenting on the stepping threshold value of number or reader's scribing line number higher than a certain shelves hot topic class, it is determined that credible wonderful belongs to the hot topic Class.Popular class can be divided into multi gear, as level 1, level 2 ..., level n, specific stepping quantity can basis The application scenarios that use and user demand determine.Credible wonderful is determined further according to popular class and abstract essence row's model Excellent degree ranking, wherein the excellent degree ranking of the high credible wonderful of popular class is higher than popular of low grade credible Wonderful, and the excellent degree ranking belonged between multiple credible wonderfuls of same popular class is then needed according to abstract Essence arranges model to determine, wherein multiple credible wonderful can successively be inputted to abstract essence row's model two-by-two, to realize elder generation Determine that it is multiple credible excellent to determine therefrom that same class includes again for the height of excellent degree between the credible wonderful of every two Excellent degree ranking between segment.
Such as: credible wonderful includes A, B, C, D, E.Wherein, first grade of popular class includes A, B and C, second gear heat Door class includes D and E, then A+B, A+C and B+C is sequentially input abstract essence row's model, obtain A excellent degree be higher than B and C, And the excellent degree of B is lower than C, then the excellent degree ranking of A, B and C are followed successively by 1,3 and 2.Similarly obtain the excellent journey of D and E Spending ranking is 2,1.It is A, C, B, E, D to obtain the whole ranking of A, B, C, D, E from high to low.
On the other hand, for insincere wonderful, insincere wonderful can be determined in each popular class first Excellent degree at least one corresponding credible wonderful predicts ranking, for example, first grade of popular class is credible excellent Segment includes A, B and C, then is directed to some insincere wonderful G, and the segment for being assumed to be first grade of popular class participates in A, ranking inside the class of B, C.It is followed successively by B, A, G, C from high to low according to the ranking that abstract essence row's model obtains A, B, C and D, That is excellent degree prediction ranking of the G in first grade of popular class is 3.Then it is insincere to predict that ranking determines according to excellent degree The excellent degree ranking of wonderful, wherein can first determine the flat of the corresponding excellent degree prediction ranking of multiple popular class Equal ranking determines the excellent degree ranking of insincere wonderful further according to average ranking.
Such as: the credible wonderful of first grade of popular class includes A, B and C, second gear hot topic class it is credible excellent Segment includes D and E.Excellent degree of the insincere wonderful G in first grade of popular class and the popular class of second gear Predict that ranking is 3 and 2, therefore the average ranking of excellent degree prediction ranking is 2.5.Again because credible wonderful A, B, C, D, The whole ranking of E from high to low be A, C, B, E, D, so the whole ranking from high to low of A, B, C, D, E and G be A, C, G, B, E、D。
It should be noted that as shown in figure 4, the reliable candidate's wonderful of stepping if it does not exist, i.e., there is no can Believe wonderful, then be used directly and make a summary essence row's model to determine the excellent degree ranking between insincere wonderful.
In conclusion as shown in figure 5, the extracting method of splendid contents provided in an embodiment of the present invention includes two steps: The thick essence row that mentions and make a summary of abstract.Wherein, abstract, which slightly mentions, is utilized the abstract model of supervision and calls together to paragraph progress wonderful It returns, the abstract that essence of making a summary row is utilized semi-supervised order models slightly to mention global tuning abstract is as a result, final to obtain Splendid contents.Method in the embodiment of the present invention can be applied to a variety of actual scenes, bring to user good using body It tests.For example, first, in speed scene, made a summary using pandect, the reader of long article anxiety can be helped to skip pandect/chapters and sections. Second, in recommending scene, it can use brief recommendation language of the splendid contents of the invention extracted as a book, to attract use Family, which is clicked, reads or buys books.Third, in long-tail content mining scene: new book restocking or unexpected winner minority's book are promoted, can be with It is shown by the splendid contents of extraction to user and shows the books, to solve the problems, such as the cold start-up of these books.4th, a Property scene in, since the content in a book is multifarious, splendid contents that the present invention extracts can also draw a portrait with user and tie It closes, to realize Individualized Notification Service etc..
In embodiments of the present invention, training text is obtained first, and determines the training information of training text, the training information The behavioural characteristic of the reader of type information and training text including multiple paragraphs, training text in training text;Then Can be according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.Secondly by type information, multiple paragraphs and excellent Section input first obtains thick climbing form type of making a summary to training pattern;Then it according to the thick climbing form type of making a summary, determines second to training pattern Training data and training data input second is trained to training pattern, obtain essence row's model of making a summary;Last basis Thick climbing form type of making a summary and abstract essence row's model, determine the object content of text to be processed.Wherein, either there is the abstract of supervision thick All monitoring datas of climbing form type or semi-supervised abstract essence row's model are all based on user behavior characteristics in popular book and construct, The automatic pumping to the splendid contents of the long texts such as books can be realized under the premise of without any artificial labeled data additionally It takes.And innovatively proposition with machine reads the method for understanding formula come to the wonderful in paragraph in the thick row's model of abstract It is predicted, can ensure wonderful/content extraction accuracy.
It is above-mentioned to illustrate the method for the embodiment of the present invention, the relevant device of the embodiment of the present invention is provided below.
Fig. 6 is referred to, Fig. 6 is a kind of structural schematic diagram of the draw-out device of object content provided in an embodiment of the present invention, The apparatus may include:
Sample collection module 601, for obtaining the first training text, which is that text size is more than to preset The long text of threshold value
In the specific implementation, the first training text can be the long text that text size is more than preset threshold, preset threshold can To refer to total number of word (such as 10,000 words), the total chapters and sections/paragraph number of training text.Wherein it is possible to but be not limited to obtain more completely Books, and using every complete books as a training text.
Information determination module 602, for determining the training information of training text, which includes in training text The behavioural characteristic of the reader of multiple paragraphs, the type information of training text and training text.
In the specific implementation, in a first aspect, training information may include multiple paragraphs in training text, wherein can be first Elder generation is multiple paragraphs to training text (i.e. a complete book) progress cutting, then by each chapter construction according to chapters and sections, wherein If the number of words of some paragraph is more than threshold value (such as 504), need to be again that two or more numbers of words do not surpass by the paragraph cutting Cross the paragraph of the threshold value.
Second aspect, since the books/text style in different type or field is totally different, training information can also include The type information of training text.Wherein, type information can be the books type sorted out when book publishing, and such as literature and art are hanged Doubt novel etc..
The third aspect, training information can also include the behavioural characteristic of the reader of training text.Wherein, for the essence in book Color content, reader generally prefer that and are recorded by way of crossing or commenting on, therefore behavioural characteristic can be at one section In or from reader since book publishing for the scribing line number of each paragraph/chapters and sections in the books or comment number.
Information determination module 602 is also used to determine the wonderful of each paragraph in multiple paragraphs according to behavioural characteristic.
In the specific implementation, can be, but not limited to reader's scribing line number or reader comment's number in each paragraph being greater than certain threshold value Wonderful of the segment as the paragraph.Wherein, the threshold value can be according to the shelf lifes of books, point reading/sales volume Deng because of usually comprehensive analysis and determination.Wherein, if the reader for some paragraph occur crosses number or reader comment's number be 0 feelings Condition, then a kind of possible countermeasure are as follows: set no answer for the wonderful of the paragraph.
Model training module 603 is instructed for inputting type information, multiple paragraphs and wonderful to training pattern Practice, obtains thick climbing form type of making a summary.
In the specific implementation, can first serialize type information (being denoted as A), A={ a is obtained1,a2,...,an, so Afterwards, for each paragraph, first the paragraph (being denoted as Q) is serialized, obtains Q={ q1,q2,...,qn, it recycles wait train A and Q are spliced for distinguishing the additional character of different types of sequence in model, obtain a group model training data (note For I), wherein can be the model based on BERT to training pattern, distinguish in BERT model for distinguishing the additional character of Q and A For CLS and SEP, to obtain
I={ [CLS];A[SEP];Q[SEP]} (7)
Certainly, it can also be SummaRuNNer model to training pattern, but need to make the model in training Adjustment.For example, it is desired to which the problem of resampling or down-sampled method are to alleviate class imbalance is added.
In practical applications, the final result of the thick climbing form type of making a summary output is one of start probability and end maximum probability The corresponding text fragments in continuous and legal section.
It should be noted that some training datas not comprising wonderful can deliberately be added in model training, with Training make a summary thick climbing form type judge in paragraph whether the ability comprising wonderful.
Text snippet module 604, for determining the object content of text to be processed according to thick climbing form type of making a summary.
In the specific implementation, text to be processed can be books, or document/text of any other length.Firstly, Text to be processed can be split according to paragraph, to obtain multiple paragraphs, wherein if the total number of word of some paragraph is more than Preset threshold also needs to be carried out secondary splitting.Then obtained each paragraph is inputted to the thick climbing form type of abstract respectively, with Just determination obtains whether each paragraph includes wonderful and export corresponding wonderful, wherein if some paragraph does not include Splendid contents then export no answers, otherwise, export corresponding wonderful.It is excellent corresponding obtaining each paragraph After section, the object content that this multiple wonderful is spliced into text to be processed can be, but not limited to.
Optionally, the device in the embodiment of the present invention can also include display module, for determining text to be processed After object content, recommendation information can be shown, which includes the splendid contents of text to be processed (such as books), is used for Recommend the books to user.
Optionally, sample collection module 601 is also used to obtain the second training text, wherein the second training text usually with First training text is not identical, and it is more than default threshold that the second training text, which is also possible to the text sizes such as a complete books, The long text of value.
Optionally, model training module 603 are also used to determine the second instruction to training pattern according to thick climbing form type of making a summary Practice data.Wherein it is possible to which the training text that first will acquire is divided into multiple sections of of length no more than threshold value (such as 504 words) It falls, and multiple paragraphs is sequentially input into the thick climbing form type of abstract, to obtain the wonderful of each paragraph.Wherein, second wait instruct Practicing model can be, but not limited to as the BERT model based on pairwise, therefore can will be determined using thick climbing form type of making a summary more A wonderful carries out combination of two as the second training data to training pattern.Wherein, combination of two, which refers to, will belong to together The wonderful of one paragraph carries out combination of two, and each combination is used as one group of training data.Paragraph can not certainly be distinguished, Multiple segments are directly subjected to arbitrary combination of two.
Optionally, model training module 603 are also used to for training data input second being trained to training pattern, obtain To abstract essence row's model.Wherein it is possible to determine the reader comment for two wonderfuls that each group of training data is included first Several or reader scribing line number.Then, the tag along sort of this group of training data is determined according to reader comment's number or reader's scribing line number (label), in embodiments of the present invention by training data be three classes, corresponding label is respectively 1,0 and -1.With training data A For+B, as shown in (4)-(6) formula: 1) if reader comment's number of wonderful A or reader cross, number is greater than wonderful B, Illustrate that A ratio B is more excellent, therefore the tag along sort of A+B is determined as 1;If 2) reader comment's number of wonderful A or reader's scribing line Number is equal to wonderful B, then illustrates that the excellent degree of A and B is identical, therefore the tag along sort of A+B is determined as 0;If 3) excellent Reader comment's number of segment A or reader number of crossing are less than wonderful B, then illustrate that B ratio A is more excellent, therefore by the contingency table of A+B Label are determined as -1.Then, every group of training data and corresponding tag along sort input second are trained to training pattern, are obtained To abstract essence row's model.
Text snippet module 604 is also used to determine text to be processed according to thick climbing form type and the abstract essence row's model of making a summary Object content.
In the specific implementation, multiple candidate wonderfuls can be obtained first by the thick climbing form type of text input to be processed abstract; Then according to abstract essence row's model, the excellent degree ranking of each candidate wonderful is determined.Wherein it is possible to by multiple candidate essences Color segment combination of two simultaneously input abstract essence row model, so as to first obtain every two candidate's wonderful excellent degree height The excellent degree ranking between multiple candidate wonderfuls is determined therefrom that again.Then, according to excellent degree ranking, multiple times are determined Select the target wonderful in wonderful, wherein can will come the candidate wonderful of top N as target excellent Section, finally by target wonderful in combination as the object content of text to be processed.
Since the construction of the training data of abstract essence row's model is based on reader comment's number or reader's scribing line number, in reality Reader conduct feature must be given up when using the model, to guarantee the reliability of model output result.Therefore implement in the present invention In example, behavior feature is utilized by the way of popular class stepping, to make up this defect, improve excellent degree ranking standard True property.As shown in figure 4, the determination of the excellent degree ranking of candidate wonderful is main including the following steps:
(1) determine in multiple chapters and sections in text to be processed that the corresponding reader of each chapters and sections crosses number or reader comment's number, And according to the corresponding reader's scribing line number of each chapters and sections or reader comment's number, sequence threshold value is determined, which may include setting Confidence threshold and stepping threshold value.Wherein it is possible to count reader's scribing line number or reader comment's number in the distribution characteristics of each chapters and sections.Example Such as, reader the scribing line number or average value, peak and the minimum of reader comment's number etc. for determining each chapters and sections, then will be averaged Value is used as confidence threshold value, and stepping threshold value is determined according to peak and minimum, for example, peak be 1000, it is minimum Value is 100, then the stepping threshold value of first grade of popular class can be determined as 800, the stepping threshold value of second gear hot topic class is true It is set to 500 and the stepping threshold value of third gear hot topic class is determined as 100.Confidence level is determined by didactic mode Threshold value and stepping threshold value can treat the books of different shelf lifes, different sale temperatures with a certain discrimination, can be improved excellent The accuracy of degree ranking.
(2) according to sequence threshold value and abstract essence row's model, excellent degree ranking is determined.Wherein it is possible to according to confidence level threshold The reader comment's number or reader's scribing line number of value and each candidate wonderful, classify to multiple candidate wonderfuls, obtain To credible wonderful and one of possible implementation of insincere wonderful are as follows: reader comment's number or reader are crossed The candidate wonderful that number is greater than confidence threshold value crosses number not as credible wonderful and by reader comment's number or reader Greater than confidence threshold value candidate wonderful as insincere wonderful.Based on this, on the one hand, be directed to credible excellent Section, can first determine the popular class of each credible wonderful, wherein as the reader of credible wonderful according to stepping threshold value When commenting on the stepping threshold value of number or reader's scribing line number higher than a certain shelves hot topic class, it is determined that credible wonderful belongs to the hot topic Class.Popular class is divided into multi gear, as level 1, level 2 ..., level n, specific stepping quantity can be according to using Application scenarios and user demand determine.The excellent of credible wonderful is determined further according to popular class and abstract essence row's model Degree ranking, wherein the excellent degree ranking of the high credible wonderful of popular class is higher than popular of low grade credible excellent Segment, and the excellent degree ranking belonged between multiple credible wonderfuls of same popular class is then needed according to abstract essence row Model determines, wherein can by multiple credible wonderful successively input abstract essence row's model two-by-two, first determined with realizing The height of excellent degree determines therefrom that excellent between the multiple credible wonderful again between every two is credible wonderful The ranking of degree.
Such as: credible wonderful includes A, B, C, D, E.Wherein, first grade of popular class includes A, B and C, second gear heat Door class includes D and E, A+B, A+C and B+C are then sequentially input into abstract essence row's model, obtain A excellent degree be higher than B and C, and the excellent degree of B is lower than C, then the excellent degree ranking of A, B and C are followed successively by 1,3 and 2.Similarly obtain the excellent of D and E Degree ranking is 2,1.It is A, C, B, E, D to obtain the whole ranking of A, B, C, D, E from high to low.
On the other hand, for insincere wonderful, insincere wonderful can be determined in each popular class first Excellent degree at least one corresponding credible wonderful predicts ranking, for example, first grade of popular class is credible excellent Segment includes A, B and C, then is directed to some insincere wonderful G, is assumed to be first grade of popular class and participates in A, B, C Class inside ranking.It is followed successively by B, A, G, C from high to low according to the ranking that abstract essence row's model obtains A, B, C and D, i.e. G exists Excellent degree prediction ranking in first grade of popular class is 3.Then it is insincere excellent to predict that ranking determines according to excellent degree The excellent degree ranking of segment, wherein can first determine the average row of the corresponding excellent degree prediction ranking of multiple popular class Name, further according to average ranking, determines the excellent degree ranking of insincere wonderful.
It should be noted that as shown in figure 4, the reliable candidate's wonderful of stepping if it does not exist, i.e., there is no can Believe wonderful, then be used directly and make a summary essence row's model to determine the excellent degree ranking between insincere wonderful.
In embodiments of the present invention, training text is obtained first, and determines the training information of training text, the training information The behavioural characteristic of the reader of type information and training text including multiple paragraphs, training text in training text;Then Can be according to behavioural characteristic, the wonderful of each paragraph in multiple paragraphs.Secondly by type information, multiple paragraphs and excellent Section input first obtains thick climbing form type of making a summary to training pattern;Then it according to the thick climbing form type of making a summary, determines second to training pattern Training data and training data input second is trained to training pattern, obtain essence row's model of making a summary;Last basis Thick climbing form type of making a summary and abstract essence row's model, determine the object content of text to be processed.Wherein, either there is the abstract of supervision thick All monitoring datas of climbing form type or semi-supervised abstract essence row's model are all based on user behavior characteristics in popular book and construct, The automatic pumping to the splendid contents of the long texts such as books can be realized under the premise of without any artificial labeled data additionally It takes.And innovatively proposition with machine reads the method for understanding formula come to the wonderful in paragraph in the thick row's model of abstract It is predicted, can ensure wonderful/content extraction accuracy.
Fig. 7 is referred to, Fig. 7 is a kind of structural schematic diagram of the extracting device of object content provided in an embodiment of the present invention. As shown, the equipment may include: at least one processor 701, at least one communication interface 702, at least one processor 703 and at least one communication bus 704.
Wherein, processor 701 can be central processor unit, general processor, digital signal processor, dedicated integrated Circuit, field programmable gate array or other programmable logic device, transistor logic, hardware component or it is any Combination.It, which may be implemented or executes, combines various illustrative logic blocks, module and electricity described in the disclosure of invention Road.The processor is also possible to realize the combination of computing function, such as combines comprising one or more microprocessors, number letter Number processor and the combination of microprocessor etc..Communication bus 704 can be Peripheral Component Interconnect standard PCI bus or extension work Industry normal structure eisa bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for indicate, It is only indicated with a thick line in Fig. 7, it is not intended that an only bus or a type of bus.Communication bus 704 is used for Realize the connection communication between these components.Wherein, the communication interface 702 of equipment is used for and other nodes in the embodiment of the present invention Equipment carries out the communication of signaling or data.Memory 703 may include volatile memory, such as non-volatile dynamic random is deposited Take memory (Nonvolatile Random Access Memory, NVRAM), phase change random access memory (Phase Change RAM, PRAM), magnetic-resistance random access memory (Magetoresistive RAM, MRAM) etc., can also include non- Volatile memory, for example, at least a disk memory, Electrical Erasable programmable read only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flush memory device, such as anti-or flash memory (NOR Flash memory) or anti-and flash memory (NAND flash memory), semiconductor devices, such as solid state hard disk (Solid State Disk, SSD) etc..Memory 703 optionally can also be that at least one is located remotely from the storage of aforementioned processor 701 Device.Batch processing code is stored in memory 703, and processor 701 executes the program in memory 703:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include more in first training text The behavioural characteristic of the reader of a paragraph, the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern, Obtain thick climbing form type of making a summary;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
Optionally, processor 701 is also used to perform the following operations step:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
Optionally, processor 701 is also used to perform the following operations step:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent of each candidate wonderful in the multiple candidate wonderful is determined Degree ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful, the mesh are determined Marking content includes the target wonderful.
Optionally, the behavioural characteristic includes reader comment's number or reader's scribing line number;
Processor 701 is also used to perform the following operations step:
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined
The multiple wonderful progress combination of two is obtained into the training data;
The reader for two wonderfuls for being included according to the training data crosses number or reader comment's number, determine described in The tag along sort of training data;
The training data and tag along sort input described second are trained to training pattern, obtain described pluck Want essence row's model.
Optionally, the text to be processed includes multiple chapters and sections;
Processor 701 is also used to perform the following operations step:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold Value includes confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
Optionally, processor 701 is also used to perform the following operations step:
It is crossed number according to the reader comment's number or reader of the confidence threshold value and each candidate wonderful, Classify to the multiple candidate wonderful, obtains credible wonderful and insincere wonderful;
The popular class of the credible wonderful is determined according to the stepping threshold value, and according to the popular class and institute State the excellent degree ranking that abstract essence row's model determines the credible wonderful;And
Determine essence of the insincere wonderful at least one corresponding credible wonderful of each hot topic class Color degree predicts ranking, and determines that the excellent degree of the insincere wonderful is arranged according to the excellent degree prediction ranking Name.
Optionally, processor 701 is also used to perform the following operations step:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
Optionally, processor 701 is also used to perform the following operations step:
Determine the excellent degree ranking of the high credible wonderful of the popular class be higher than the hot topic it is of low grade can Believe wonderful;And
According to abstract essence row's model, the excellent degree between the identical credible wonderful of the popular class is determined Ranking.
Optionally, processor 701 is also used to perform the following operations step:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user This.
Further, processor can also be matched with memory and communication interface, execute mesh in foregoing invention embodiment Mark the operation of the draw-out device of content.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in Within protection scope of the present invention.

Claims (10)

1. a kind of abstracting method of object content, which is characterized in that the described method includes:
The first training text is obtained, first training text is the long text that text size is more than preset threshold;
Determine that the training information of first training text, the training information include multiple sections in first training text It falls, the behavioural characteristic of the reader of the type information of first training text and first training text;
According to the behavioural characteristic, the wonderful of each paragraph in the multiple paragraph is determined;
The type information, the multiple paragraph and the wonderful input first is trained to training pattern, is obtained It makes a summary thick climbing form type;
According to the thick climbing form type of abstract, the object content of text to be processed is determined.
2. the method as described in claim 1, which is characterized in that it is described according to the thick climbing form type of abstract, determine text to be processed This object content includes:
According to the thick climbing form type of abstract, the second training data to training pattern is determined;
Training data input second is trained to training pattern, obtains essence row's model of making a summary;
According to the thick climbing form type of the abstract and abstract essence row's model, the object content is determined.
3. method according to claim 2, which is characterized in that described according to the thick climbing form type of the abstract and abstract essence row Model determines that the object content includes:
By the thick climbing form type of abstract described in the text input to be processed, multiple candidate wonderfuls are obtained;
According to abstract essence row's model, the excellent degree of each candidate wonderful in the multiple candidate wonderful is determined Ranking;
According to the excellent degree ranking, the target wonderful in the multiple candidate wonderful is determined, in the target Holding includes the target wonderful.
4. method according to claim 2, which is characterized in that the behavioural characteristic includes reader comment's number or reader's scribing line Number;
It is described according to the thick climbing form type of the abstract, determine that second includes: to the training data of training pattern
The second training text is obtained, second training text is the long text that text size is more than the preset threshold;
According to the thick climbing form type of abstract, multiple wonderfuls in second training text are determined;
The multiple wonderful progress combination of two is obtained into the training data;
Described second to be trained training data input to training pattern, the essence row's model that obtains making a summary includes:
According to the reader's scribing line number or reader comment's number of two wonderfuls that the training data is included, the training is determined The tag along sort of data;
The training data and tag along sort input described second are trained to training pattern, obtain the abstract essence Arrange model.
5. method as claimed in claim 3, which is characterized in that the text to be processed includes multiple chapters and sections;
It is described according to abstract essence row's model, determine the excellent of each candidate wonderful in the multiple candidate wonderful Degree ranking includes:
Determine in the multiple chapters and sections that the corresponding reader of each chapters and sections crosses number or reader comment's number;
It is crossed number or reader comment's number according to the corresponding reader of each chapters and sections, determines sequence threshold value, the sequence threshold value packet Include confidence threshold value and stepping threshold value;
According to the sequence threshold value and abstract essence row's model, the excellent degree ranking is determined.
6. method as claimed in claim 5, which is characterized in that described according to the sequence threshold value and abstract essence row's mould Type determines that the excellent degree ranking includes:
According to the confidence threshold value and the reader comment's number or reader's scribing line number of each candidate wonderful, to institute It states multiple candidate wonderfuls to classify, obtains credible wonderful and insincere wonderful;
It determines the popular class of the credible wonderful according to the stepping threshold value, and according to the popular class and described plucks Essence row's model is wanted to determine the excellent degree ranking of the credible wonderful;And
Determine excellent journey of the insincere wonderful at least one corresponding credible wonderful of each hot topic class Degree prediction ranking, and predict that ranking determines the excellent degree ranking of the insincere wonderful according to the excellent degree.
7. method as claimed in claim 6, which is characterized in that it is described that ranking is predicted according to the excellent degree, determine described in The excellent degree ranking of insincere wonderful includes:
Determine the average ranking of the corresponding excellent degree prediction ranking of multiple popular class;
According to the average ranking, the excellent degree ranking of the insincere segment is determined.
8. method as claimed in claim 6, which is characterized in that described according to the popular class and abstract essence row's model The excellent degree ranking for determining the credible wonderful includes:
Determine that the excellent degree ranking of the high credible wonderful of the popular class is higher than the hot topic credible essence of low grade Color segment;And
According to abstract essence row's model, the excellent degree row between the identical credible wonderful of the popular class is determined Name.
9. the method according to claim 1, which is characterized in that it is described according to the thick climbing form type of abstract, it determines After the object content of text to be processed, further includes:
Show that recommendation information, the recommendation information include the object content, for recommending the text to be processed to user.
10. a kind of draw-out device of object content, which is characterized in that described device includes:
Sample collection module, for obtaining the first training sample, first training text is that text size is more than preset threshold Long text;
Information determination module, for determining that the training information of first training text, the training information include described first The row of the reader of the type information and first training text of multiple paragraphs, first training text in training text It is characterized;
The information determination module, is also used to according to the behavioural characteristic, determines the excellent of each paragraph in the multiple paragraph Segment;
Model training module, for the type information and wonderful input first to be trained to training pattern, Obtain thick climbing form type of making a summary;
Text snippet module, for determining the object content of text to be processed according to the thick climbing form type of abstract.
CN201910716302.1A 2019-07-31 2019-07-31 Target content extraction method and related equipment Active CN110427482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910716302.1A CN110427482B (en) 2019-07-31 2019-07-31 Target content extraction method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910716302.1A CN110427482B (en) 2019-07-31 2019-07-31 Target content extraction method and related equipment

Publications (2)

Publication Number Publication Date
CN110427482A true CN110427482A (en) 2019-11-08
CN110427482B CN110427482B (en) 2024-07-23

Family

ID=68414062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910716302.1A Active CN110427482B (en) 2019-07-31 2019-07-31 Target content extraction method and related equipment

Country Status (1)

Country Link
CN (1) CN110427482B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN112749544A (en) * 2020-12-28 2021-05-04 苏州思必驰信息科技有限公司 Training method and system for paragraph segmentation model
CN112800465A (en) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 Method and device for processing text data to be labeled, electronic equipment and medium
CN113035310A (en) * 2019-12-25 2021-06-25 医渡云(北京)技术有限公司 Deep learning-based medical RCT report analysis method and device
WO2021135469A1 (en) * 2020-06-17 2021-07-08 平安科技(深圳)有限公司 Machine learning-based information extraction method, apparatus, computer device, and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104885081A (en) * 2012-12-27 2015-09-02 触摸式有限公司 Search system and corresponding method
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104885081A (en) * 2012-12-27 2015-09-02 触摸式有限公司 Search system and corresponding method
US20170213130A1 (en) * 2016-01-21 2017-07-27 Ebay Inc. Snippet extractor: recurrent neural networks for text summarization at industry scale

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
何海江 等: "由排序支持向量机抽取博客文章的摘要", 电子科技大学学报, no. 04, 30 July 2010 (2010-07-30) *
王帅 等: "TP-AS:一种面向长文本的两阶段自动摘要方法", 中文信息学报, no. 06, 15 June 2018 (2018-06-15) *
王晗 等: "针对用户兴趣的视频精彩片段提取", 中国图象图形学报, no. 05, 16 May 2018 (2018-05-16) *
陈海华 等: "基于引文上下文的学术文本自动摘要技术研究", 数字图书馆论坛, no. 08, 25 August 2016 (2016-08-25) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN113035310A (en) * 2019-12-25 2021-06-25 医渡云(北京)技术有限公司 Deep learning-based medical RCT report analysis method and device
CN113035310B (en) * 2019-12-25 2024-01-09 医渡云(北京)技术有限公司 Medical RCT report analysis method and device based on deep learning
WO2021135469A1 (en) * 2020-06-17 2021-07-08 平安科技(深圳)有限公司 Machine learning-based information extraction method, apparatus, computer device, and medium
CN112749544A (en) * 2020-12-28 2021-05-04 苏州思必驰信息科技有限公司 Training method and system for paragraph segmentation model
CN112749544B (en) * 2020-12-28 2024-04-30 思必驰科技股份有限公司 Training method and system of paragraph segmentation model
CN112800465A (en) * 2021-02-09 2021-05-14 第四范式(北京)技术有限公司 Method and device for processing text data to be labeled, electronic equipment and medium

Also Published As

Publication number Publication date
CN110427482B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
CN110427482A (en) A kind of abstracting method and relevant device of object content
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
Tian et al. Towards predicting the best answers in community-based question-answering services
CN108304373B (en) Semantic dictionary construction method and device, storage medium and electronic device
CN108629043A (en) Extracting method, device and the storage medium of webpage target information
CN110442841A (en) Identify method and device, the computer equipment, storage medium of resume
CN104025085A (en) Systems And Methods For Providing Information Regarding Semantic Entities Included In A Page Of Content
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words
CN112231485B (en) Text recommendation method and device, computer equipment and storage medium
CN109902271A (en) Text data mask method, device, terminal and medium based on transfer learning
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN112131881B (en) Information extraction method and device, electronic equipment and storage medium
CN109325146A (en) A kind of video recommendation method, device, storage medium and server
CN109492230A (en) A method of insurance contract key message is extracted based on textview field convolutional neural networks interested
CN109582788A (en) Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing
CN107247751A (en) Content recommendation method based on LDA topic models
CN113011126B (en) Text processing method, text processing device, electronic equipment and computer readable storage medium
CN111708878A (en) Method, device, storage medium and equipment for extracting sports text abstract
CN113204624A (en) Multi-feature fusion text emotion analysis model and device
CN113689144A (en) Quality assessment system and method for product description
CN117390140B (en) Chinese aspect emotion analysis method and system based on machine reading understanding
CN116956866A (en) Scenario data processing method, apparatus, device, storage medium and program product
Aurnhammer et al. Manual Annotation of Unsupervised Models: Close and Distant Reading of Politics on Reddit.
CN104462151A (en) Method for evaluating web page publishing time and related device
Gao et al. An attention-based ID-CNNs-CRF model for named entity recognition on clinical electronic medical records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant