CN108121697A - Method, apparatus, equipment and the computer storage media that a kind of text is rewritten - Google Patents
Method, apparatus, equipment and the computer storage media that a kind of text is rewritten Download PDFInfo
- Publication number
- CN108121697A CN108121697A CN201711138896.XA CN201711138896A CN108121697A CN 108121697 A CN108121697 A CN 108121697A CN 201711138896 A CN201711138896 A CN 201711138896A CN 108121697 A CN108121697 A CN 108121697A
- Authority
- CN
- China
- Prior art keywords
- text
- extensive
- template
- content
- ingredient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of method that text is rewritten, and this method includes:Obtain content of text to be rewritten;Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;Matching rewriting template corresponding with the extensive template, and the content of text is rewritten based on the template of rewriting.The present invention to content of text progress is extensive by obtaining extensive template, and then according to the corresponding rewriting template of obtained extensive template matches, the rewriting template obtained according to matching rewrites content of text, whole sentence rewriting is carried out so as to fulfill to content of text, promotes the rewriting effect of content of text.
Description
【Technical field】
The present invention relates to method, apparatus, equipment and the calculating that natural language processing technique more particularly to a kind of text are rewritten
Machine storage medium.
【Background technology】
Rewriting technology have substantial amounts of application demand, such as search engine in order to expand recall, it is necessary to user query into
Row is rewritten;Some literal resources are in order to improve diversity, it is necessary to use rewriting;Some article titles have needed more preferable, equally
It needs to use rewriting.And existing rewriting technology, the frame progress of statistical machine translation is all based on mostly.Although controllability
By force, accuracy rate is high, but this part has small rewriting difference, revised sentence clause and former sentence to local Improvement
It is identical, the shortcomings of some specific rewriting demands can not be met.Therefore, it is urgent to provide a kind of texts that can promote rewriting effect
The method of rewriting.
【The content of the invention】
In view of this, the present invention provides method, apparatus, equipment and the computer storage media that a kind of text is rewritten, use
In promotion to the rewriting effect of content of text.
The present invention is to provide a kind of method of text rewriting, the method for technical scheme applied to solve the technical problem
Including:Obtain content of text to be rewritten;Determine the content of text can extensive ingredient, obtain the extensive of the content of text
Template;Matching rewriting template corresponding with the extensive template, and the content of text is changed based on the template of rewriting
It writes.
According to one preferred embodiment of the present invention, it is described determine the content of text can extensive ingredient include:To the text
This content carries out cutting word processing, obtains the cutting word result of the content of text;The cutting word result is parsed, described in acquisition
The part of speech of each word in content of text;Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
According to one preferred embodiment of the present invention, the default part of speech it is extensive requirement be:To noun, the number in content of text
At least one of word and time word carry out extensive.
According to one preferred embodiment of the present invention, the extensive template for obtaining the content of text includes:Based on definite
Can extensive ingredient to the content of text carry out it is extensive, obtain each extensive result;It is obtained using each extensive result in the text
The extensive template held.
According to one preferred embodiment of the present invention, the rewriting template corresponding with extensive template is advance in the following way
Generation:Obtain the parallel corpora of text pair;Based on the extensive requirement of default part of speech determine text pair each text can be extensive
Ingredient, based on it is identified can extensive ingredient to each text carry out it is extensive;By the extensive knot of a text in each text
Fruit is as extensive template, and the extensive result of another text is as corresponding rewriting template.
According to one preferred embodiment of the present invention, it is described it is extensive including:Can extensive ingredient be generalized for its corresponding part of speech slot
Position wherein carrying out permutation and combination to each extensive ingredient when extensive, obtains each extensive result.
According to one preferred embodiment of the present invention, the parallel corpora of the text pair obtains in the following way:It obtains
Corpus of text;Determine the alignment score between arbitrary text pair in the corpus of text;Alignment score is met into preset requirement
Text is to the parallel corpora as text pair.
According to one preferred embodiment of the present invention, the alignment score determined in the corpus of text between arbitrary text pair
Including:Cutting word processing is carried out to each text, obtains the cutting word result of each text;Using default deletion dictionary to the cutting word knot
Ingredient of deleting in fruit is marked;Determine the alignment for the ingredient not being labeled between two cutting word results of the text pair
Probability utilizes the alignment score between the alignment determine the probability text pair.
According to one preferred embodiment of the present invention, the matching is corresponding with the extensive template rewrite template before, and also
Including:To not carrying out synonymous extension by extensive ingredient in the extensive template;Or default compressible structure dictionary is utilized,
The specific structure included in the extensive template is compressed.
According to one preferred embodiment of the present invention, the method further includes:The rewriting mould that in-service evaluation model obtains matching
Plate is given a mark;According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite content of text.
According to one preferred embodiment of the present invention, the evaluation model trains to obtain in the following way in advance:It obtains
Training sample, the training sample include extensive template template pair corresponding with template is rewritten, what rewriting template marked in advance
Point;Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting,
Obtain evaluation model.
According to one preferred embodiment of the present invention, the matching characteristic between the template pair includes:Slot position alignment information, slot position
Term vector similarity, slot position proper name similarity, the literal similarity in slot position, slot bit boundary language model value, text justification degree, mould
At least one of score is estimated in plate alignment number and click.
The present invention is to provide a kind of device of text rewriting, described device for technical scheme applied to solve the technical problem
Including:Acquiring unit, for obtaining content of text to be rewritten;Extensive unit, can be extensive for determine the content of text
Ingredient obtains the extensive template of the content of text;Unit is rewritten, for matching rewriting mould corresponding with the extensive template
Plate, and the content of text is rewritten based on the rewriting template.
According to one preferred embodiment of the present invention, what the extensive unit was used to determining the content of text can extensive ingredient
When, it is specific to perform:Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;To the cutting word
As a result parsed, obtain the part of speech of each word in the content of text;Based on the extensive requirement of default part of speech, the text is determined
This content can extensive ingredient.
According to one preferred embodiment of the present invention, the default part of speech it is extensive requirement be:To noun, the number in content of text
At least one of word and time word carry out extensive.
According to one preferred embodiment of the present invention, when the extensive unit is used to obtain the extensive template of the content of text,
It is specific to perform:Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;Using each extensive
As a result the extensive template of the content of text is obtained.
According to one preferred embodiment of the present invention, described device further includes generation unit, for previously generating and extensive template
It is specific to perform during corresponding rewriting template:Obtain the parallel corpora of text pair;Text is determined based on the extensive requirement of default part of speech
Each text of centering can extensive ingredient, based on it is identified can extensive ingredient to each text carry out it is extensive;It will be in each text
A text extensive result as extensive template, the extensive result of another text is as corresponding rewriting template.
According to one preferred embodiment of the present invention, it is specific to perform when the extensive unit or generation unit carry out extensive:It can
Extensive ingredient is generalized for its corresponding part of speech slot position, wherein carrying out permutation and combination to each extensive ingredient when extensive, obtains each general
Change result.
According to one preferred embodiment of the present invention, when the generation unit obtains the parallel corpora of the text pair, specifically hold
Row:Obtain corpus of text;Determine the alignment score between arbitrary text pair in the corpus of text;Its score will be met default
It is required that text to the parallel corpora as text pair.
According to one preferred embodiment of the present invention, the generation unit is in the corpus of text is determined between arbitrary text pair
Alignment score when, it is specific to perform:Cutting word processing is carried out to each text, obtains the cutting word result of each text;It is deleted using default
Except the ingredient of deleting in the cutting word result is marked in dictionary;It determines between two cutting word results of the text pair not
The alignment probability of labeled ingredient utilizes the alignment score between the alignment determine the probability text pair.
According to one preferred embodiment of the present invention, the rewriting unit is in matching rewriting template corresponding with the extensive template
Before, also perform:To not carrying out synonymous extension by extensive ingredient in the extensive template;Or utilize default compressible knot
Word-building allusion quotation is compressed the specific structure included in the extensive template.
According to one preferred embodiment of the present invention, the rewriting unit is additionally operable to perform:In-service evaluation model obtains matching
Rewriting template give a mark;According to marking as a result, the rewriting template for meeting preset requirement is used to change content of text
It writes.
According to one preferred embodiment of the present invention, described device further includes training unit, and evaluation mould is obtained for training in advance
It is specific to perform during type:Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, changes
Write the score that template marks in advance;Using the matching characteristic of the template pair as input, the marked score is used as output,
Training Logic Regression Models, obtain evaluation model.
As can be seen from the above technical solutions, it is extensive to text progress to be primarily based on the extensive requirement of default part of speech by the present invention
The extensive template of text is obtained, is then matching rewriting template corresponding with extensive template, based on the matched rewriting template pair of institute
Text is rewritten, so as to fulfill the purpose that text rewrites effect is promoted.
【Description of the drawings】
Fig. 1 is the method flow diagram that the text that one embodiment of the invention provides is rewritten;
Fig. 2 is the structure drawing of device that the text that one embodiment of the invention provides is rewritten;
Fig. 3 is the block diagram for the computer system/server that one embodiment of the invention provides.
【Specific embodiment】
It is right in the following with reference to the drawings and specific embodiments in order to make the object, technical solutions and advantages of the present invention clearer
The present invention is described in detail.
The term used in embodiments of the present invention is only merely for the purpose of description specific embodiment, and is not intended to be limiting
The present invention.In the embodiment of the present invention and " one kind " of singulative used in the attached claims, " described " and "the"
It is also intended to including most forms, unless context clearly shows that other meanings.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent
There may be three kinds of relations, for example, A and/or B, can represent:Individualism A, exists simultaneously A and B, individualism B these three
Situation.In addition, character "/" herein, it is a kind of relation of "or" to typically represent forward-backward correlation object.
Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection
(condition or event of statement) " can be construed to " when definite " or " in response to determining " or " when the detection (condition of statement
Or event) when " or " in response to detecting (condition or event of statement) ".
Fig. 1 is the method flow diagram that the text that one embodiment of the invention provides is rewritten, as shown in fig. 1, the method bag
It includes:
In 101, content of text to be rewritten is obtained.
In this step, acquired content of text to be rewritten can be the title rewritten, or
The search key rewritten.
In 102, determine the content of text can extensive ingredient, obtain the extensive template of the content of text.
In this step, after extensive to the content of text progress acquired in step 101, the extensive mould of text content is obtained
Plate.When carrying out extensive to content of text, it is first determined text content can extensive ingredient, be then based on it is definite can be extensive
Ingredient carries out text content extensive.
Specifically, determine content of text can extensive ingredient when, following manner may be employed:First to text content
Cutting word processing is carried out, obtains the cutting word result of text content;Then the cutting word result of text content is parsed, obtained
Text content includes the part of speech of word;Based on the extensive requirement of default part of speech, determine text content can extensive ingredient.
Wherein, default part of speech it is extensive requirement be:At least one of noun, number and time word are carried out extensive.Therefore, this article
This content can extensive ingredient include at least one of noun, number and time word.It will additionally be appreciated that institute is extensive
Noun can include the words such as place, name, all kinds of specific terms and common noun.
Text content can be carried out extensive after extensive ingredient determine content of text, obtain the general of text content
Change result.To in text content can extensive ingredient carry out extensive process, i.e., by content of text can extensive ingredient it is general
The process of the part of speech slot position corresponding to the extensive ingredient is turned to, such as " Qingdao " belongs to place noun, then is generalized for " [
Point] ";" October " belongs to time word, then is generalized for " [time] ".
After the extensive result of content of text is obtained, ingredient included in the extensive result of text content is arranged
Row combination, so as to obtain the extensive template of text content.This is because ingredient included in the extensive result of content of text
May have very much, it is therefore desirable to permutation and combination be carried out to extensive result, to obtain the extensive template of whole of corresponding text content.
For example, if the extensive result of certain content of text is " [number] [noun] in related [place] ", which is carried out
After permutation and combination, obtained extensive template can include " [number] [noun] in related [place] ", " [number] is related [place]
[noun] ", " [number] [noun] related [place] " etc..
The extensive process for obtaining extensive template is carried out to above-mentioned content of text to be illustrated:If desired the text rewritten
Content be " 10 secrets in relation to Qingdao ", first to the text content progress cutting word processing, obtain " related ", " Qingdao ",
" ", the cutting word result of " 10 " and " secret ";Then part of speech parsing is carried out to cutting word result, determines the part of speech of above-mentioned word,
Such as " Qingdao " belongs to place, " 10 " belong to number etc.;Then by " Qingdao " and " secret " that belongs to noun, belong to number
" 10 " progress are extensive, and obtained extensive result can be " [number] [noun] in related [place] ", or " related
[number] in [place] is secret ", can also be " [number] [noun] in related Qingdao " etc.;Finally to all extensive results into
Row permutation and combination, obtains the extensive template of text content, and obtained extensive template can be including " [number] is related [place]
[noun] ", " [number] secret related [place] ", " [noun] in [number] related Qingdao " etc..
In 103, matching rewriting template corresponding with the extensive template, and based on the template of rewriting to the text
Content is rewritten.
In this step, based on the obtained extensive template of step 102, match with after the rewriting template corresponding to it,
The rewriting template obtained based on matching rewrites content of text, so as to obtain the rewriting result of text content.Wherein, often
A extensive template can be corresponding at least one rewriting template, therefore can determine phase therewith according to obtained extensive template
The rewriting template matched somebody with somebody.
Specifically, previously generated in the following way with the corresponding template of rewriting of extensive template:
(1) parallel corpora of text pair is obtained.
In this step, the parallel corpora of acquired text pair is the relevant text pair on semantic and syntax, i.e.,
Belong between the text that the text pair of parallel corpora is included it is semantic it is related, syntax is related.
Before the parallel corpora of text pair is obtained, it is necessary first to obtain corpus of text.Text language material can be for
Rewrite the query-query language materials of search key, or, can be with for rewriting the title-title language materials of title
For query-title language materials.The present embodiment is illustrated exemplified by rewriting title, then acquired corpus of text is title-
Title language materials.
Wherein, acquired title-title language materials are the title of whole search results of corresponding a certain search key.Cause
This, in the following manner is may be employed when obtaining title-title language materials:Based on daily record is showed, obtain crucial according to a search
The obtained whole search results of word, arbitrarily select a pair, as title-title languages from title corresponding with search result
Material.
According to the obtained corpus of text of previous step, since the text that each corpus of text is included is same to all corresponding to
Search key, therefore can tentatively assert the text to having certain semantic dependency between the text that is included.Therefore
After corpus of text is obtained, by determining that each language material includes the alignment score between text, the syntax between text is determined
Correlation.
Specifically, in the following manner may be employed and determine that each language material includes the alignment score between text:
1) cutting word processing is carried out to each text first, obtains the cutting word result of each text.
2) default deletion dictionary is used, the ingredient of deleting in the cutting word result is marked.
In the deletion dictionary pre-established, record has many meaningless, deletable words or a phrase, such as " you not
Know ", " do you know ", " exposition " etc., these words or phrase will not generate shadow to the semanteme and information content of entire sentence
It rings.When establishing the deletion dictionary, it can be counted, will deleted by deletable ingredient in the corpus of text to having obtained
Frequency is higher than the constituent of the word or phrase of certain threshold value as the deletion dictionary.
In this step, after the cutting word result for obtaining each text, according to the cutting word result of each text of deletion dictionary lookup
In whether containing ingredient can be deleted, if so, then ingredient of deleting contained in the cutting word result of each text is marked.
3) the alignment probability for the ingredient not being labeled between two cutting word results of text pair is determined, it is true using the probability that aligns
Determine the alignment score between text pair.
Wherein, the alignment probability for the ingredient not being labeled between two cutting word results of text pair, i.e. text pair text
The probability that the ingredient that the probability and text two that one ingredient included occurs in text two are included occurs in text one,
Utilize the alignment score between obtained alignment determine the probability text pair.
For example, if the text one of text pair includes 5 ingredients, text two is also comprising 5 ingredients, if one institute of text
Comprising 5 ingredients all occur in text two, then text one is 1 with the probability that aligns of text two, if text one included 5
A ingredient has 4 to appear in text two, then text one is 0.8 with the probability that aligns of text two;It can similarly obtain, if text two is wrapped
5 ingredients contained all occur in text one, then text two is 1 with the probability that aligns of text one, if 5 that text two is included
Ingredient has 3 to occur in text one, then text two is 0.6 with the probability that aligns of text one.
Utilize the alignment score between the alignment determine the probability text pair between two cutting word results of text pair.For example,
The probability that aligns between text one and text two is 1, and the probability that aligns between text two and text one is 0.8, then text is to it
Between alignment score can be (1,0.8);Two alignment probability can also be averaged, then the alignment score between text pair
For 0.9.It, then can be using the text to the parallel language as text pair when the alignment score between text pair meets preset requirement
Material.The mode that predetermined threshold value may be employed determines the text pair for meeting preset requirement.If the alignment between text pair be scored at (1,
0.8), then two alignment probability in the alignment score need to be more than predetermined threshold value simultaneously, can just determine that the text is pre- to meeting
If it is required that;If the alignment between text pair is scored at 0.9, when which is more than predetermined threshold value, it is determined that the text
To meeting preset requirement.
(2) based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on identified
Can extensive ingredient to each text carry out it is extensive.
Based on the extensive requirement of default part of speech, determine each text of text pair can extensive ingredient.In the present embodiment, in advance
If part of speech it is extensive requirement be:It is extensive to the progress of at least one of noun, number and time word, i.e., by each text of text pair
At least one of this noun included, number and time word are used as can extensive ingredient.Determine each text in can be general
After chemical conversion point, each text is carried out extensive.Wherein, each text is carried out extensive process be by each text can extensive ingredient it is general
Turn to the process of its corresponding part of speech slot position.It will additionally be appreciated that since ingredient included in extensive result may have
Very much, therefore also need to carry out ingredient included in the extensive result of each text permutation and combination, it is all possible to obtain
Extensive result.
(3) using the extensive result of a text in each text as extensive template, the extensive result of another text is made
For corresponding rewriting template.
The extensive of each text is obtained as a result, the extensive result can be used to indicate that each text after extensive to the progress of each text
Sentence structure.Using the extensive result of a text in each text as extensive template, and the extensive result of another text
As template is rewritten, i.e. a text is corresponding with rewriting template to obtained extensive template.
The above process is illustrated, the title-title language materials obtained first are " 50 in relation to capital of a country secret
It is close " and " on 50 secrets in capital of a country, you both know about ";Using dictionary can be deleted to deleting into above-mentioned two text
It point is marked, such as " you both know about " is labeled as to delete ingredient;Then the alignment score between text pair is obtained, if
Other word all aligns in two texts in addition to it can delete ingredient, then the alignment score between text pair is all 1, therefore this
The text is to that can be used as parallel corpora;After the definite text is to for parallel corpora, based on the extensive requirement of default part of speech
Determine above-mentioned two text can extensive ingredient, if " capital of a country [place] ", " 50 [number] ", " secret [noun] " is can be extensive
Ingredient, then the extensive result of two texts is " [number] [noun] in related [place] " and " [number] on [place]
[noun], you both know about (can delete) ";It can be using extensive result " [number] [noun] in related [place] " as extensive mould
" on [number] [noun] in [place], you both know about " is used as corresponding rewriting template by plate.
In this step, before matching rewriting template corresponding with extensive template is carried out, mould can also further be taken
Plate expanding policy expands the scope that extensive template matches rewrite template.
Optionally, during one of the present embodiment specific implementation, can in extensive template not by extensive ingredient
Carry out synonymous extension.Specifically, to not carrying out synonymous rewriting by extensive ingredient in extensive template, that is, synonym, alias are utilized
Deng to not carrying out content replacement by extensive ingredient in extensive template.For example, if extensive template is " who is the old of [name]
The extensive template if the synonym of " wife " is " wife ", can be rewritten as " wife who is [name] " by mother-in-law ";It is if general
It is " [number] [noun] of programmer " to change template, and the alias of " programmer " is " code agriculture ", then can rewrite the extensive template
For " [number] [noun] of code agriculture ".
It can also be based on default compressible structure dictionary, the specific structure in extensive template is compressed.Wherein,
Comprising the structure and corresponding compression result that can be compressed in the compressible structure dictionary, such as " can will determine
Language+noun " structure compresses are " noun ", can be " noun " etc. by " number+noun " structure compresses.For example, if text
Content is " cuisines of Pekinese 10 ", if its extensive template is " Pekinese [number] [noun] ", wherein " 10 cuisines " belong to
The structure of " number+noun " then compresses it into " [noun 1] ", then the extensive template of text content becomes " Pekinese's [name
Word 1] ".It is understood that when making to carry out templates-Extension in this way, needed when being rewritten to text to compression
Structure is reduced, will " [noun 1] " be reduced to " 10 cuisines ".
Match it is corresponding with extensive template rewrite template after, based on the rewriting template that the matching obtains to content of text into
Row is rewritten, and will rewrite the corresponding word that extensive ingredient present in template is reduced in text content.Citing comes
It says, if content of text to be rewritten is " 10 secrets in relation to Qingdao ", if its extensive template is " [number] in related [place]
It is secret ", template of rewriting corresponding with the extensive template if " [number] on [place] is secret, you both know about ", wherein
Extensive ingredient " [place] " correspondence " Qingdao ", " [number] " correspondence " 10 ", then final rewriting result is " 10 on Qingdao
A secret, you both know about ".
It will additionally be appreciated that due to match it is corresponding with extensive template rewrite template when, may there are multiple
Template is rewritten, therefore can be given a mark to multiple rewriting templates, is used when determining according to marking result and rewrite content of text
Rewriting template.Wherein, when giving a mark to rewriting template, can be beaten using the evaluation model that advance training obtains
Point.
Specifically, which trains to obtain in the following way in advance:Obtain training sample, acquired instruction
Practicing sample includes extensive template template pair corresponding with template is rewritten and rewrites the score that template marks in advance;Extract template
To matching characteristic, using the matching characteristic for the template pair extracted as input, rewrite the marked score of template as output,
Training Logic Regression Models, obtain evaluation model.
Wherein, the matching characteristic of the extensive template extracted template pair corresponding with template is rewritten includes:Slot position alignment letter
Breath, including slot position aligned registry probability, reversely align probability, alignment number etc.;Slot position term vector similarity calculates slot position word
The cosine similarities of vector;Slot position proper name similarity using classification special dictionary, judges whether slot position belongs to generic;Slot
The literal similarity in position calculates similarity behind each slot position of cutting to word rank;Back is replaced in slot bit boundary language model value, slot position
Clear and coherent degree at boundary;Text justification degree, determines whether all unjustified ingredients occur in the body of the email;Template alignment number, it is right
Template is counted, and embodies the confidence level of template;Score is estimated in click, and the click of template is carried out in advance using prediction model is clicked on
The score estimated.
After in-service evaluation model gives a mark to rewriting template, according to the score corresponding to each rewriting template, it will meet
The rewriting template of preset requirement is as final rewriting template.If the score for respectively rewriting template differs, by score most
High rewriting template is as final rewriting template;If the rewriting template of highest scoring has multiple, therefrom make for optional one
For final rewriting template.After final rewriting template is determined, content of text is rewritten using the rewriting template, it will
In text content can extensive ingredient reduced, obtain the rewriting result of text content.
Fig. 2 is the structure drawing of device that the text that one embodiment of the invention provides is rewritten, as shown in Figure 2, described device bag
It includes:Acquiring unit 21, extensive unit 22, generation unit 23, rewriting unit 24 and training unit 25.
Acquiring unit 21, for obtaining content of text to be rewritten.
Content of text to be rewritten acquired in acquiring unit 21 can be the title rewritten, or need
The search key rewritten.
Extensive unit 22, for determine the content of text can extensive ingredient, obtain the extensive mould of the content of text
Plate.
After extensive unit 22 is extensive to the content of text progress acquired in acquiring unit 21, the extensive of text content is obtained
Template.When carrying out extensive to content of text, extensive unit 22 determine first text content can extensive ingredient, be then based on
It is definite can extensive ingredient text content is carried out it is extensive.
Specifically, extensive unit 22 determine content of text can extensive ingredient when, following manner may be employed:It is right first
Text content carries out cutting word processing, obtains the cutting word result of text content;Then to the cutting word result of text content into
Row parsing obtains the part of speech that text content includes word;Based on the extensive requirement of default part of speech, text content is determined
It can extensive ingredient.Wherein, default part of speech it is extensive requirement be:At least one of noun, number and time word are carried out general
Change.Therefore, text content can extensive ingredient include at least one of noun, number and time word.In addition can manage
Solution, extensive noun can include the words such as place, name, all kinds of specific terms and common noun.
Extensive unit 22 can carry out text content extensive after extensive ingredient determine content of text, obtain this article
The extensive result of this content.Extensive unit 22 in text content can extensive ingredient carry out extensive process, i.e., by text
In content can extensive ingredient be generalized for the process of the part of speech slot position corresponding to the extensive ingredient, such as " Qingdao " belongs to location name
Word is then generalized in " [place] ";" October " belongs to time word, then is generalized for " [time] ".
Extensive unit 22, can also be to institute in the extensive result of text content after the extensive result of content of text is obtained
Comprising ingredient carry out permutation and combination, so as to obtain the extensive template of text content.This is because the extensive knot of content of text
Ingredient included in fruit may have very much, it is therefore desirable to permutation and combination be carried out to extensive result, to obtain in the corresponding text
The extensive template of whole of appearance.For example, if the extensive result of certain content of text is " [number] [noun] in related [place] ",
After carrying out permutation and combination to the extensive result, obtained extensive template can include " [number] [noun] in related [place] ",
" [noun] in [number] related [place] ", " [number] [noun] related [place] " etc..
Generation unit 23, for previously generating rewriting template corresponding with extensive template.
Generation unit 23 may be employed when for previously generating rewriting template corresponding with extensive template such as lower section
Formula:
(1) parallel corpora of text pair is obtained.
The parallel corpora of text pair acquired in generation unit 23 is the relevant text pair on semantic and syntax, that is, is belonged to
Semantic related, syntax correlation between the text included in the text pair of parallel corpora.
Generation unit 23 is before the parallel corpora of text pair is obtained, it is necessary first to obtain corpus of text.Text language material
Can be for rewriting the query-query language materials of search key, or for rewriting the title-title languages of title
Material, can also be query-title language materials.The present embodiment is illustrated exemplified by rewriting title, then acquired corpus of text
For title-title language materials.
Wherein, the title-title language materials acquired in generation unit 23 are searched for for the whole of corresponding a certain search key
As a result title.Therefore, in the following manner may be employed when obtaining title-title language materials in generation unit 23:Based on showing day
Will is obtained according to the obtained whole search results of a search key, arbitrarily chosen from title corresponding with search result
Choosing is a pair of, as title-title language materials.
According to the obtained corpus of text of previous step, since the text that each corpus of text is included is same to all corresponding to
Search key, therefore can tentatively assert the text to having certain semantic dependency between the text that is included.Therefore
After corpus of text is obtained, generation unit 23 determines text by determining that each language material includes the alignment score between text
Between syntax correlation.
Specifically, generation unit 23 may be employed in the following manner and determine alignment score between text pair:
1) cutting word processing is carried out to each text first, obtains the cutting word result of each text.
2) default deletion dictionary is used, the ingredient of deleting in the cutting word result is marked.
In the deletion dictionary pre-established, record has many meaningless, deletable words or a phrase, such as " you not
Know ", " do you know ", " exposition " etc., these words or phrase will not generate shadow to the semanteme and information content of entire sentence
It rings.When establishing the deletion dictionary, it can be counted, will deleted by deletable ingredient in the corpus of text to having obtained
Frequency is higher than the constituent of the word or phrase of certain threshold value as the deletion dictionary.
After generation unit 23 obtains the cutting word result of each text, according in the cutting word result of each text of deletion dictionary lookup
Whether containing ingredient can be deleted, if so, then ingredient of deleting contained in the cutting word result of each text is marked.
3) the alignment probability for the ingredient not being labeled between two cutting word results of text pair is determined, it is true using the probability that aligns
Determine the alignment score between text pair.
Wherein, the alignment probability for the ingredient not being labeled between two cutting word results of text pair, i.e. text pair text
The probability that the ingredient that the probability and text two that one ingredient included occurs in text two are included occurs in text one,
Utilize the alignment score between obtained alignment determine the probability text pair.
For example, if the text one of text pair includes 5 ingredients, text two is also comprising 5 ingredients, if one institute of text
Comprising 5 ingredients all occur in text two, then text one is 1 with the probability that aligns of text two, if text one included 5
A ingredient has 4 to appear in text two, then text one is 0.8 with the probability that aligns of text two;It can similarly obtain, if text two is wrapped
5 ingredients contained all occur in text one, then text two is 1 with the probability that aligns of text one, if 5 that text two is included
Ingredient has 3 to occur in text one, then text two is 0.6 with the probability that aligns of text one.
Generation unit 23 utilizes the alignment between the alignment determine the probability text pair between two cutting word results of text pair
Score.For example, if the probability that aligns between text one and text two is 1, the probability that aligns between text two and text one is
0.8, then the alignment score between text pair can be (1,0.8);Two alignment probability can also be averaged, then text pair
Between alignment be scored at 0.9.When the alignment score between text pair meets preset requirement, generation unit 23 can be by this article
This is to the parallel corpora as text pair.The mode that predetermined threshold value may be employed in generation unit 23 determines the text for meeting preset requirement
This is right.If the alignment between text pair is scored at (1,0.8), then two alignment probability in the alignment score need to be more than pre- simultaneously
If threshold value, it can just determine the text to meeting preset requirement;If the alignment between text pair is scored at 0.9, the alignment
When score is more than predetermined threshold value, it is determined that the text is to meeting preset requirement.
(2) based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on identified
Can extensive ingredient to each text carry out it is extensive.
Generation unit 23 is based on the extensive requirement of default part of speech, and determine each text of text pair can extensive ingredient.At this
In embodiment, the default extensive requirement of part of speech is:It is extensive to the progress of at least one of noun, number and time word, it will
At least one of noun, number and time word that each text of text pair is included is used as can extensive ingredient.Generation unit
23 can carry out each text extensive after extensive ingredient in each text is determined.Wherein, generation unit 23 carries out each text general
The process of change be by each text can extensive ingredient be generalized for the process of its corresponding part of speech slot position.It will additionally be appreciated that
Since ingredient included in extensive result may have very much, generation unit 23 can also be in the extensive result of each text
Comprising ingredient carry out permutation and combination, to obtain all possible extensive result.
(3) using the extensive result of a text in each text as extensive template, the extensive result of another text is made
For corresponding rewriting template.
Generation unit 23 obtains the extensive of each text as a result, the extensive result can be used after extensive to the progress of each text
In the sentence structure for representing each text.Generation unit 23 using the extensive result of a text in each text as extensive template,
And the extensive result of another text is used as and rewrites template, i.e. a text is opposite with rewriting template to obtained extensive template
It should.
Unit 24 is rewritten, for matching rewriting template corresponding with the extensive template, and template pair is rewritten based on described
The content of text is rewritten.
Rewrite unit 24 be based on the obtained extensive template of extensive unit 22, previously generated using generation unit 23 with it is general
After the rewriting template for changing template matches, the rewriting template obtained based on matching rewrites content of text, so as to obtain this article
The rewriting result of this content.Wherein, each extensive template can be corresponding at least one rewriting template, therefore according to obtained
Extensive template can determine matched rewriting template.
Unit 24 is rewritten before matching rewriting template corresponding with extensive template is carried out, can also further take template
Expanding policy expands the scope that extensive template matches rewrite template.
Optionally, during one of the present embodiment specific implementation, rewrite unit 24 can in extensive template not by
Extensive ingredient carries out synonymous extension.Specifically, unit 24 is rewritten to not carrying out synonymous change by extensive ingredient in extensive template
It writes, i.e., content replacement is carried out to ingredient included in extensive template using synonym, alias etc..For example, if extensive mould
Plate is " who is the wife of [name] ", if the synonym of " wife " is " wife ", rewriting unit 24 can be by the extensive template
It is rewritten as " wife who is [name] ";If extensive template is " [number] [noun] of programmer ", the alias of " programmer " is
" code agriculture ", then the extensive template can be rewritten as " [number] [noun] of code agriculture " by rewriting unit 24.
It can also be based on default compressible structure dictionary, rewrite unit 24 and the specific structure in extensive template is carried out
Compression.Wherein, comprising the structure and corresponding compression result that can be compressed in the compressible structure dictionary, such as
It can be " noun " by " attribute+noun " structure compresses, can be " noun " etc. by " number+noun " structure compresses.Citing comes
It says, if content of text is " cuisines of Pekinese 10 ", if its extensive template is " Pekinese [number] [noun] ", wherein " 10
Cuisines " belong to the structure of " number+noun ", then rewrite unit 24 and compress it into " [noun 1] ", then text content is extensive
Template becomes " Pekinese [noun 1] ".It is understood that when making to carry out templates-Extension in this way, to text into
Row rewrite when need to reduce pressure texture, will " [noun 1] " be reduced to " 10 cuisines ".
Match it is corresponding with extensive template rewrite template after, based on the rewriting template that the matching obtains to content of text into
Row is rewritten, and will rewrite the corresponding word that extensive ingredient present in template is reduced in text content.Citing comes
It says, if content of text to be rewritten is " 10 secrets in relation to Qingdao ", if its extensive template is " [number] in related [place]
It is secret ", template of rewriting corresponding with the extensive template if " [number] on [place] is secret, you both know about ", wherein
Extensive ingredient " [place] " correspondence " Qingdao ", " [number] " correspondence " 10 ", then final rewriting result is " 10 on Qingdao
A secret, you both know about ".
It will additionally be appreciated that since multiple rewriting templates when being rewritten to content of text, may be obtained, then change
After r/w cell 24 can also give a mark to the multiple rewriting templates obtained, final rewriting mould is determined according to marking result
Plate.It, will according to the score corresponding to each rewriting template after rewriting 24 in-service evaluation model of unit and giving a mark to rewriting template
Meet the rewriting template of preset requirement as final rewriting template.If the score for respectively rewriting template differs, will
Divide highest rewriting template as final rewriting template;If the rewriting template of highest scoring has multiple, therefrom optional one
It is a as final rewriting template.Unit 24 is rewritten to rewrite content of text using identified final rewriting template,
To obtain the rewriting result of text content.
Training unit 25 obtains evaluation model for training in advance.
Unit 24 is rewritten to be obtained by the training of training unit 25 to rewriting used evaluation model when template is given a mark.
Specifically, training unit 25 is to train to obtain the evaluation model in the following way in advance:
Training sample is obtained, it is corresponding with template is rewritten that the training sample acquired in training unit 25 includes extensive template
The score that template pair and rewriting template mark in advance;After training unit 25 extracts the matching characteristic of template pair, by what is extracted
Matching characteristic rewrites the marked score of template as output, training Logic Regression Models obtain evaluation model as input.
Wherein, the matching characteristic of the extensive template that training unit 25 is extracted template pair corresponding with template is rewritten includes:
Slot position alignment information, including slot position aligned registry probability, reversely align probability, alignment number etc.;Slot position term vector similarity, i.e.,
Calculate the cosine similarities of slot position term vector;Slot position proper name similarity using classification special dictionary, judges whether slot position belongs to
It is generic;The literal similarity in slot position calculates similarity behind each slot position of cutting to word rank;Slot bit boundary language model value, slot
Replace the clear and coherent degree at back boundary in position;Text justification degree, determines whether all unjustified ingredients occur in the body of the email;Template pair
Homogeneous number, counts template, embodies the confidence level of template;Score is estimated in click, using click prediction model to template
Click on the score estimated.
Fig. 3 shows to be used for the frame for the exemplary computer system/server 012 for realizing embodiment of the present invention
Figure.The computer system/server 012 that Fig. 3 is shown is only an example, function that should not be to the embodiment of the present invention and use
Range band carrys out any restrictions.
As shown in figure 3, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes
The component of business device 012 can include but is not limited to:One or more processor or processing unit 016, system storage
028, the bus 018 of connection different system component (including system storage 028 and processing unit 016).
Bus 018 represents the one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts
For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises various computing systems readable medium.These media can be appointed
What usable medium that can be accessed by computer system/server 012, including volatile and non-volatile medium, movably
With immovable medium.
System storage 028 can include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other
Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can
For reading and writing immovable, non-volatile magnetic media (Fig. 3 is not shown, is commonly referred to as " hard disk drive ").Although in Fig. 3
Be not shown, can provide for move non-volatile magnetic disk (such as " floppy disk ") read-write disc driver and pair can
The CD drive that mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these situations
Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 can include
At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured
To perform the function of various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can be stored in such as memory
In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other
Program module and program data may include the realization of network environment in each or certain combination in these examples.Journey
Sequence module 042 usually performs function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment,
Display 024 etc.) communication, in the present invention, computer system/server 012 communicates with outside radar equipment, can also be with
One or more enables a user to the equipment interacted with the computer system/server 012 communication and/or with causing the meter
Any equipment that calculation machine systems/servers 012 can communicate with one or more of the other computing device (such as network interface card, modulation
Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes
Being engaged in device 012 can also be by network adapter 020 and one or more network (such as LAN (LAN), wide area network (WAN)
And/or public network, such as internet) communication.As shown in the figure, network adapter 020 by bus 018 and computer system/
Other modules communication of server 012.It should be understood that although not shown in the drawings, computer system/server 012 can be combined
Using other hardware and/or software module, include but not limited to:Microcode, device driver, redundant processing unit, external magnetic
Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 is stored in program in system storage 028 by operation, so as to perform various functions using with
And data processing, such as realize a kind of method that text is rewritten, it can include:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and based on the rewriting template to content of text progress
It rewrites.
Above-mentioned computer program can be arranged in computer storage media, i.e., the computer storage media is encoded with
Computer program, the program by one or more computers when being performed so that one or more computers are performed in the present invention
State the method flow shown in embodiment and/or device operation.For example, the method stream performed by said one or multiple processors
Journey can include:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and based on the rewriting template to content of text progress
It rewrites.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by
Tangible medium, can also directly be downloaded from network etc..Any combination of one or more computer-readable media may be employed.
Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium
Matter for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device or
The arbitrary above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes:There are one tools
Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM),
Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light
Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can
To be any tangible medium for including or storing program, the program can be commanded execution system, device or device use or
Person is in connection.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal,
Wherein carry computer-readable program code.Diversified forms may be employed in the data-signal of this propagation, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium beyond computer readable storage medium, which can send, propagate or
Transmission for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.It can be with one or more programmings
Language or its combination write to perform the computer program code that operates of the present invention, described program design language include towards
The programming language of object-such as Java, Smalltalk, C++ further includes conventional procedural programming language-all
Such as " C " language or similar programming language.Program code can perform fully on the user computer, partly with
On the computer of family perform, the software package independent as one perform, part on the user computer part on the remote computer
It performs or performs on a remote computer or server completely.In the situation for being related to remote computer, remote computer can
To pass through the network of any kind --- subscriber computer is connected to including LAN (LAN) or wide area network (WAN), alternatively, can
To be connected to outer computer (such as passing through Internet connection using ISP).
Using technical solution provided by the present invention, by obtaining extensive template, Jin Ergen to content of text progress is extensive
According to the corresponding rewriting template of obtained extensive template matches, content of text is carried out according to the rewriting template that matching obtains
It rewrites, possesses the function of increasing/delete fractions, it is larger to rewrite the conversion degree of result, so as to reach to content of text
The effect of whole sentence rewriting is carried out, enables to user more obvious to the perception of revised text.
In several embodiments provided by the present invention, it should be understood that disclosed system, apparatus and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
Division is only a kind of division of logic function, can there is other dividing mode in actual implementation.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical location, you can be located at a place or can also be distributed to multiple
In network element.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be employed in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) perform the present invention
The part steps of embodiment the method.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various
The medium of program code can be stored.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.
Claims (25)
1. a kind of method that text is rewritten, which is characterized in that the described method includes:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and the content of text is changed based on the template of rewriting
It writes.
2. according to the method described in claim 1, it is characterized in that, it is described determine the content of text can be extensive into subpackage
It includes:
Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;
The cutting word result is parsed, obtains the part of speech of each word in the content of text;
Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
3. according to the method described in claim 2, it is characterized in that, the extensive requirement of the default part of speech is:To content of text
In at least one of noun, number and time word carry out it is extensive.
4. according to the method described in claim 1, it is characterized in that, the extensive template for obtaining the content of text includes:
Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;
The extensive template of the content of text is obtained using each extensive result.
5. according to the method described in claim 1, it is characterized in that, the rewriting template corresponding with extensive template is using such as
What under type previously generated:
Obtain the parallel corpora of text pair;
Based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on it is identified can it is extensive into
Divide extensive to the progress of each text;
Using the extensive result of a text in each text as extensive template, the extensive result of another text as with
Its corresponding rewriting template.
6. method according to claim 4 or 5, which is characterized in that it is described it is extensive including:
Can extensive ingredient be generalized for its corresponding part of speech slot position, wherein carry out permutation and combination to each extensive ingredient when extensive,
Obtain each extensive result.
7. according to the method described in claim 5, it is characterized in that, the parallel corpora of the text pair is to obtain in the following way
It arrives:
Obtain corpus of text;
Determine the alignment score between arbitrary text pair in the corpus of text;
Alignment score is met into the text of preset requirement to the parallel corpora as text pair.
8. according to the method described in claim 6, it is characterized in that, described determine in the corpus of text between arbitrary text pair
Alignment score include:
Cutting word processing is carried out to each text, obtains the cutting word result of each text;
The ingredient of deleting in the cutting word result is marked using default deletion dictionary;
It determines the alignment probability for the ingredient not being labeled between two cutting word results of the text pair, utilizes the alignment probability
Determine the alignment score between text pair.
9. according to the method described in claim 1, it is characterized in that, in matching rewriting mould corresponding with the extensive template
Before plate, further include:
To not carrying out synonymous extension by extensive ingredient in the extensive template;Or
Using default compressible structure dictionary, the specific structure included in the extensive template is compressed.
10. according to the method described in claim 1, it is characterized in that, the method further includes:
The rewriting template that in-service evaluation model obtains matching is given a mark;
According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite the content of text.
11. according to the method described in claim 9, it is characterized in that, the evaluation model is to train in advance in the following way
It obtains:
Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, it is advance to rewrite template
The score of mark;
Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting,
Obtain evaluation model.
12. according to the method for claim 11, which is characterized in that the matching characteristic between the template pair includes:Slot position
Alignment information, slot position term vector similarity, slot position proper name similarity, the literal similarity in slot position, slot bit boundary language model value, just
At least one of score is estimated in literary degree of registration, template alignment number and click.
13. the device that a kind of text is rewritten, which is characterized in that described device includes:
Acquiring unit, for obtaining content of text to be rewritten;
Extensive unit, for determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Unit is rewritten, for matching rewriting template corresponding with the extensive template, and based on the template of rewriting to the text
This content is rewritten.
14. device according to claim 13, which is characterized in that the extensive unit is used to determine the content of text
Can extensive ingredient when, it is specific to perform:
Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;
The cutting word result is parsed, obtains the part of speech of each word in the content of text;
Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
15. device according to claim 14, which is characterized in that the extensive requirement of default part of speech is:To in text
At least one of noun, number and time word in appearance carry out extensive.
16. device according to claim 13, which is characterized in that the extensive unit is used to obtain the content of text
It is specific to perform during extensive template:
Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;
The extensive template of the content of text is obtained using each extensive result.
17. device according to claim 13, which is characterized in that described device further includes generation unit, for advance
It is specific to perform when generating rewriting template corresponding with extensive template:
Obtain the parallel corpora of text pair;
Based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on it is identified can it is extensive into
Divide extensive to the progress of each text;
Using the extensive result of a text in each text as extensive template, the extensive result of another text as with
Its corresponding rewriting template.
18. the device according to claim 16 or 17, which is characterized in that the extensive unit or generation unit progress are extensive
When, it is specific to perform:
Can extensive ingredient be generalized for its corresponding part of speech slot position, wherein carry out permutation and combination to each extensive ingredient when extensive,
Obtain each extensive result.
19. device according to claim 17, which is characterized in that the generation unit obtains the parallel language of the text pair
It is specific to perform during material:
Obtain corpus of text;
Determine the alignment score between arbitrary text pair in the corpus of text;
Alignment score is met into the text of preset requirement to the parallel corpora as text pair.
20. device according to claim 19, which is characterized in that the generation unit is appointed in the corpus of text is determined
It is specific to perform during the alignment score between text pair of anticipating:
Cutting word processing is carried out to each text, obtains the cutting word result of each text;
The ingredient of deleting in the cutting word result is marked using default deletion dictionary;
It determines the alignment probability for the ingredient not being labeled between two cutting word results of the text pair, utilizes the alignment probability
Determine the alignment score between text pair.
21. device according to claim 13, which is characterized in that the rewriting unit is in matching and the extensive template pair
Before the rewriting template answered, also perform:
To not carrying out synonymous extension by extensive ingredient in the extensive template;Or
Using default compressible structure dictionary, the specific structure included in the extensive template is compressed.
22. device according to claim 13, which is characterized in that the rewriting unit is additionally operable to perform:
The rewriting template that in-service evaluation model obtains matching is given a mark;
According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite the content of text.
23. device according to claim 22, which is characterized in that described device further includes training unit, for instructing in advance
It is specific to perform when getting evaluation model:
Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, it is advance to rewrite template
The score of mark;
Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting,
Obtain evaluation model.
24. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are performed by one or more of processors so that one or more of processors are real
The now method as described in any in claim 1-12.
25. a kind of storage medium for including computer executable instructions, the computer executable instructions are by computer disposal
Method when device performs for execution as described in any in claim 1-12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711138896.XA CN108121697B (en) | 2017-11-16 | 2017-11-16 | Method, device and equipment for text rewriting and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711138896.XA CN108121697B (en) | 2017-11-16 | 2017-11-16 | Method, device and equipment for text rewriting and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108121697A true CN108121697A (en) | 2018-06-05 |
CN108121697B CN108121697B (en) | 2022-02-25 |
Family
ID=62228457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711138896.XA Active CN108121697B (en) | 2017-11-16 | 2017-11-16 | Method, device and equipment for text rewriting and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108121697B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241286A (en) * | 2018-09-21 | 2019-01-18 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating text |
CN109739968A (en) * | 2018-12-29 | 2019-05-10 | 北京猎户星空科技有限公司 | A kind of data processing method and device |
CN109766537A (en) * | 2019-01-16 | 2019-05-17 | 北京未名复众科技有限公司 | Study abroad document methodology of composition, device and electronic equipment |
CN110309280A (en) * | 2019-05-27 | 2019-10-08 | 重庆小雨点小额贷款有限公司 | A kind of corpus expansion method and relevant device |
CN111666775A (en) * | 2020-05-21 | 2020-09-15 | 平安科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN113822034A (en) * | 2021-06-07 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Method and device for repeating text, computer equipment and storage medium |
CN113935306A (en) * | 2021-09-14 | 2022-01-14 | 有米科技股份有限公司 | Method and device for processing advertising pattern template |
CN115713071A (en) * | 2022-11-11 | 2023-02-24 | 北京百度网讯科技有限公司 | Training method of neural network for processing text and method for processing text |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5442546A (en) * | 1991-11-29 | 1995-08-15 | Hitachi, Ltd. | System and method for automatically generating translation templates from a pair of bilingual sentences |
CN101346716A (en) * | 2005-12-22 | 2009-01-14 | 国际商业机器公司 | A method and system for editing text with a find and replace function leveraging derivations of the find and replace input |
CN101470700A (en) * | 2007-12-28 | 2009-07-01 | 日电(中国)有限公司 | Text template generator, text generation equipment, text checking equipment and method thereof |
CN101639826A (en) * | 2009-09-01 | 2010-02-03 | 西北大学 | Text hidden method based on Chinese sentence pattern template transformation |
CN103020040A (en) * | 2011-09-27 | 2013-04-03 | 富士通株式会社 | Rewriting processing method and equipment of source languages, and machine translation system |
CN103186509A (en) * | 2011-12-29 | 2013-07-03 | 北京百度网讯科技有限公司 | Wildcard character class template generalization method and device and general template generalization method and system |
CN103678270A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Semantic unit extracting method and semantic unit extracting device |
CN106610972A (en) * | 2015-10-21 | 2017-05-03 | 阿里巴巴集团控股有限公司 | Query rewriting method and apparatus |
CN106650943A (en) * | 2016-10-28 | 2017-05-10 | 北京百度网讯科技有限公司 | Auxiliary writing method and apparatus based on artificial intelligence |
JP2017129994A (en) * | 2016-01-19 | 2017-07-27 | 日本電信電話株式会社 | Sentence rewriting device, method, and program |
-
2017
- 2017-11-16 CN CN201711138896.XA patent/CN108121697B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5442546A (en) * | 1991-11-29 | 1995-08-15 | Hitachi, Ltd. | System and method for automatically generating translation templates from a pair of bilingual sentences |
CN101346716A (en) * | 2005-12-22 | 2009-01-14 | 国际商业机器公司 | A method and system for editing text with a find and replace function leveraging derivations of the find and replace input |
CN101470700A (en) * | 2007-12-28 | 2009-07-01 | 日电(中国)有限公司 | Text template generator, text generation equipment, text checking equipment and method thereof |
CN101639826A (en) * | 2009-09-01 | 2010-02-03 | 西北大学 | Text hidden method based on Chinese sentence pattern template transformation |
CN103020040A (en) * | 2011-09-27 | 2013-04-03 | 富士通株式会社 | Rewriting processing method and equipment of source languages, and machine translation system |
CN103186509A (en) * | 2011-12-29 | 2013-07-03 | 北京百度网讯科技有限公司 | Wildcard character class template generalization method and device and general template generalization method and system |
CN103678270A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Semantic unit extracting method and semantic unit extracting device |
CN106610972A (en) * | 2015-10-21 | 2017-05-03 | 阿里巴巴集团控股有限公司 | Query rewriting method and apparatus |
JP2017129994A (en) * | 2016-01-19 | 2017-07-27 | 日本電信電話株式会社 | Sentence rewriting device, method, and program |
CN106650943A (en) * | 2016-10-28 | 2017-05-10 | 北京百度网讯科技有限公司 | Auxiliary writing method and apparatus based on artificial intelligence |
Non-Patent Citations (4)
Title |
---|
刘圆圆: "基于模板的对几种特殊结构句子的语句改写", 《现代电子技术》 * |
林燕芬: "基于模板的汉语复句改写方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
桑亚辉: "基于模板方法的汉语语句自动改写研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
谢碧清: "中文句式改写算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241286A (en) * | 2018-09-21 | 2019-01-18 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating text |
CN109739968A (en) * | 2018-12-29 | 2019-05-10 | 北京猎户星空科技有限公司 | A kind of data processing method and device |
CN109766537A (en) * | 2019-01-16 | 2019-05-17 | 北京未名复众科技有限公司 | Study abroad document methodology of composition, device and electronic equipment |
CN110309280A (en) * | 2019-05-27 | 2019-10-08 | 重庆小雨点小额贷款有限公司 | A kind of corpus expansion method and relevant device |
CN110309280B (en) * | 2019-05-27 | 2021-11-09 | 重庆小雨点小额贷款有限公司 | Corpus expansion method and related equipment |
CN111666775A (en) * | 2020-05-21 | 2020-09-15 | 平安科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN111666775B (en) * | 2020-05-21 | 2023-08-22 | 平安科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
CN113822034A (en) * | 2021-06-07 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Method and device for repeating text, computer equipment and storage medium |
CN113822034B (en) * | 2021-06-07 | 2024-04-19 | 腾讯科技(深圳)有限公司 | Method, device, computer equipment and storage medium for replying text |
CN113935306A (en) * | 2021-09-14 | 2022-01-14 | 有米科技股份有限公司 | Method and device for processing advertising pattern template |
CN115713071A (en) * | 2022-11-11 | 2023-02-24 | 北京百度网讯科技有限公司 | Training method of neural network for processing text and method for processing text |
CN115713071B (en) * | 2022-11-11 | 2024-06-18 | 北京百度网讯科技有限公司 | Training method for neural network for processing text and method for processing text |
Also Published As
Publication number | Publication date |
---|---|
CN108121697B (en) | 2022-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108121697A (en) | Method, apparatus, equipment and the computer storage media that a kind of text is rewritten | |
CN107204184B (en) | Audio recognition method and system | |
CN107908635B (en) | Method and device for establishing text classification model and text classification | |
CN109657054B (en) | Abstract generation method, device, server and storage medium | |
US10102191B2 (en) | Propagation of changes in master content to variant content | |
US20200210468A1 (en) | Document recommendation method and device based on semantic tag | |
WO2021135469A1 (en) | Machine learning-based information extraction method, apparatus, computer device, and medium | |
JP5744228B2 (en) | Method and apparatus for blocking harmful information on the Internet | |
WO2020233269A1 (en) | Method and apparatus for reconstructing 3d model from 2d image, device and storage medium | |
CN109493977A (en) | Text data processing method, device, electronic equipment and computer-readable medium | |
CN111597351A (en) | Visual document map construction method | |
CN109325201A (en) | Generation method, device, equipment and the storage medium of entity relationship data | |
US20120047172A1 (en) | Parallel document mining | |
CN105210055B (en) | According to the hyphenation device across languages phrase table | |
CN110569335B (en) | Triple verification method and device based on artificial intelligence and storage medium | |
CN110276023A (en) | POI changes event discovery method, apparatus, calculates equipment and medium | |
CN111144120A (en) | Training sentence acquisition method and device, storage medium and electronic equipment | |
US20180101521A1 (en) | Avoiding sentiment model overfitting in a machine language model | |
CN111259262A (en) | Information retrieval method, device, equipment and medium | |
CN111008309A (en) | Query method and device | |
CN110457683A (en) | Model optimization method, apparatus, computer equipment and storage medium | |
CN114595686A (en) | Knowledge extraction method, and training method and device of knowledge extraction model | |
US11074402B1 (en) | Linguistically consistent document annotation | |
CN108268602A (en) | Analyze method, apparatus, equipment and the computer storage media of text topic point | |
JP2022093317A (en) | Computer-implemented method, system and computer program product (recognition and restructuring of previously presented information) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |