CN108121697A - Method, apparatus, equipment and the computer storage media that a kind of text is rewritten - Google Patents

Method, apparatus, equipment and the computer storage media that a kind of text is rewritten Download PDF

Info

Publication number
CN108121697A
CN108121697A CN201711138896.XA CN201711138896A CN108121697A CN 108121697 A CN108121697 A CN 108121697A CN 201711138896 A CN201711138896 A CN 201711138896A CN 108121697 A CN108121697 A CN 108121697A
Authority
CN
China
Prior art keywords
text
extensive
template
content
ingredient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711138896.XA
Other languages
Chinese (zh)
Other versions
CN108121697B (en
Inventor
袁德璋
付志宏
周古月
何径舟
张小彬
陈笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201711138896.XA priority Critical patent/CN108121697B/en
Publication of CN108121697A publication Critical patent/CN108121697A/en
Application granted granted Critical
Publication of CN108121697B publication Critical patent/CN108121697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of method that text is rewritten, and this method includes:Obtain content of text to be rewritten;Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;Matching rewriting template corresponding with the extensive template, and the content of text is rewritten based on the template of rewriting.The present invention to content of text progress is extensive by obtaining extensive template, and then according to the corresponding rewriting template of obtained extensive template matches, the rewriting template obtained according to matching rewrites content of text, whole sentence rewriting is carried out so as to fulfill to content of text, promotes the rewriting effect of content of text.

Description

Method, apparatus, equipment and the computer storage media that a kind of text is rewritten
【Technical field】
The present invention relates to method, apparatus, equipment and the calculating that natural language processing technique more particularly to a kind of text are rewritten Machine storage medium.
【Background technology】
Rewriting technology have substantial amounts of application demand, such as search engine in order to expand recall, it is necessary to user query into Row is rewritten;Some literal resources are in order to improve diversity, it is necessary to use rewriting;Some article titles have needed more preferable, equally It needs to use rewriting.And existing rewriting technology, the frame progress of statistical machine translation is all based on mostly.Although controllability By force, accuracy rate is high, but this part has small rewriting difference, revised sentence clause and former sentence to local Improvement It is identical, the shortcomings of some specific rewriting demands can not be met.Therefore, it is urgent to provide a kind of texts that can promote rewriting effect The method of rewriting.
【The content of the invention】
In view of this, the present invention provides method, apparatus, equipment and the computer storage media that a kind of text is rewritten, use In promotion to the rewriting effect of content of text.
The present invention is to provide a kind of method of text rewriting, the method for technical scheme applied to solve the technical problem Including:Obtain content of text to be rewritten;Determine the content of text can extensive ingredient, obtain the extensive of the content of text Template;Matching rewriting template corresponding with the extensive template, and the content of text is changed based on the template of rewriting It writes.
According to one preferred embodiment of the present invention, it is described determine the content of text can extensive ingredient include:To the text This content carries out cutting word processing, obtains the cutting word result of the content of text;The cutting word result is parsed, described in acquisition The part of speech of each word in content of text;Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
According to one preferred embodiment of the present invention, the default part of speech it is extensive requirement be:To noun, the number in content of text At least one of word and time word carry out extensive.
According to one preferred embodiment of the present invention, the extensive template for obtaining the content of text includes:Based on definite Can extensive ingredient to the content of text carry out it is extensive, obtain each extensive result;It is obtained using each extensive result in the text The extensive template held.
According to one preferred embodiment of the present invention, the rewriting template corresponding with extensive template is advance in the following way Generation:Obtain the parallel corpora of text pair;Based on the extensive requirement of default part of speech determine text pair each text can be extensive Ingredient, based on it is identified can extensive ingredient to each text carry out it is extensive;By the extensive knot of a text in each text Fruit is as extensive template, and the extensive result of another text is as corresponding rewriting template.
According to one preferred embodiment of the present invention, it is described it is extensive including:Can extensive ingredient be generalized for its corresponding part of speech slot Position wherein carrying out permutation and combination to each extensive ingredient when extensive, obtains each extensive result.
According to one preferred embodiment of the present invention, the parallel corpora of the text pair obtains in the following way:It obtains Corpus of text;Determine the alignment score between arbitrary text pair in the corpus of text;Alignment score is met into preset requirement Text is to the parallel corpora as text pair.
According to one preferred embodiment of the present invention, the alignment score determined in the corpus of text between arbitrary text pair Including:Cutting word processing is carried out to each text, obtains the cutting word result of each text;Using default deletion dictionary to the cutting word knot Ingredient of deleting in fruit is marked;Determine the alignment for the ingredient not being labeled between two cutting word results of the text pair Probability utilizes the alignment score between the alignment determine the probability text pair.
According to one preferred embodiment of the present invention, the matching is corresponding with the extensive template rewrite template before, and also Including:To not carrying out synonymous extension by extensive ingredient in the extensive template;Or default compressible structure dictionary is utilized, The specific structure included in the extensive template is compressed.
According to one preferred embodiment of the present invention, the method further includes:The rewriting mould that in-service evaluation model obtains matching Plate is given a mark;According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite content of text.
According to one preferred embodiment of the present invention, the evaluation model trains to obtain in the following way in advance:It obtains Training sample, the training sample include extensive template template pair corresponding with template is rewritten, what rewriting template marked in advance Point;Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting, Obtain evaluation model.
According to one preferred embodiment of the present invention, the matching characteristic between the template pair includes:Slot position alignment information, slot position Term vector similarity, slot position proper name similarity, the literal similarity in slot position, slot bit boundary language model value, text justification degree, mould At least one of score is estimated in plate alignment number and click.
The present invention is to provide a kind of device of text rewriting, described device for technical scheme applied to solve the technical problem Including:Acquiring unit, for obtaining content of text to be rewritten;Extensive unit, can be extensive for determine the content of text Ingredient obtains the extensive template of the content of text;Unit is rewritten, for matching rewriting mould corresponding with the extensive template Plate, and the content of text is rewritten based on the rewriting template.
According to one preferred embodiment of the present invention, what the extensive unit was used to determining the content of text can extensive ingredient When, it is specific to perform:Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;To the cutting word As a result parsed, obtain the part of speech of each word in the content of text;Based on the extensive requirement of default part of speech, the text is determined This content can extensive ingredient.
According to one preferred embodiment of the present invention, the default part of speech it is extensive requirement be:To noun, the number in content of text At least one of word and time word carry out extensive.
According to one preferred embodiment of the present invention, when the extensive unit is used to obtain the extensive template of the content of text, It is specific to perform:Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;Using each extensive As a result the extensive template of the content of text is obtained.
According to one preferred embodiment of the present invention, described device further includes generation unit, for previously generating and extensive template It is specific to perform during corresponding rewriting template:Obtain the parallel corpora of text pair;Text is determined based on the extensive requirement of default part of speech Each text of centering can extensive ingredient, based on it is identified can extensive ingredient to each text carry out it is extensive;It will be in each text A text extensive result as extensive template, the extensive result of another text is as corresponding rewriting template.
According to one preferred embodiment of the present invention, it is specific to perform when the extensive unit or generation unit carry out extensive:It can Extensive ingredient is generalized for its corresponding part of speech slot position, wherein carrying out permutation and combination to each extensive ingredient when extensive, obtains each general Change result.
According to one preferred embodiment of the present invention, when the generation unit obtains the parallel corpora of the text pair, specifically hold Row:Obtain corpus of text;Determine the alignment score between arbitrary text pair in the corpus of text;Its score will be met default It is required that text to the parallel corpora as text pair.
According to one preferred embodiment of the present invention, the generation unit is in the corpus of text is determined between arbitrary text pair Alignment score when, it is specific to perform:Cutting word processing is carried out to each text, obtains the cutting word result of each text;It is deleted using default Except the ingredient of deleting in the cutting word result is marked in dictionary;It determines between two cutting word results of the text pair not The alignment probability of labeled ingredient utilizes the alignment score between the alignment determine the probability text pair.
According to one preferred embodiment of the present invention, the rewriting unit is in matching rewriting template corresponding with the extensive template Before, also perform:To not carrying out synonymous extension by extensive ingredient in the extensive template;Or utilize default compressible knot Word-building allusion quotation is compressed the specific structure included in the extensive template.
According to one preferred embodiment of the present invention, the rewriting unit is additionally operable to perform:In-service evaluation model obtains matching Rewriting template give a mark;According to marking as a result, the rewriting template for meeting preset requirement is used to change content of text It writes.
According to one preferred embodiment of the present invention, described device further includes training unit, and evaluation mould is obtained for training in advance It is specific to perform during type:Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, changes Write the score that template marks in advance;Using the matching characteristic of the template pair as input, the marked score is used as output, Training Logic Regression Models, obtain evaluation model.
As can be seen from the above technical solutions, it is extensive to text progress to be primarily based on the extensive requirement of default part of speech by the present invention The extensive template of text is obtained, is then matching rewriting template corresponding with extensive template, based on the matched rewriting template pair of institute Text is rewritten, so as to fulfill the purpose that text rewrites effect is promoted.
【Description of the drawings】
Fig. 1 is the method flow diagram that the text that one embodiment of the invention provides is rewritten;
Fig. 2 is the structure drawing of device that the text that one embodiment of the invention provides is rewritten;
Fig. 3 is the block diagram for the computer system/server that one embodiment of the invention provides.
【Specific embodiment】
It is right in the following with reference to the drawings and specific embodiments in order to make the object, technical solutions and advantages of the present invention clearer The present invention is described in detail.
The term used in embodiments of the present invention is only merely for the purpose of description specific embodiment, and is not intended to be limiting The present invention.In the embodiment of the present invention and " one kind " of singulative used in the attached claims, " described " and "the" It is also intended to including most forms, unless context clearly shows that other meanings.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent There may be three kinds of relations, for example, A and/or B, can represent:Individualism A, exists simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, it is a kind of relation of "or" to typically represent forward-backward correlation object.
Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when definite " or " in response to determining " or " when the detection (condition of statement Or event) when " or " in response to detecting (condition or event of statement) ".
Fig. 1 is the method flow diagram that the text that one embodiment of the invention provides is rewritten, as shown in fig. 1, the method bag It includes:
In 101, content of text to be rewritten is obtained.
In this step, acquired content of text to be rewritten can be the title rewritten, or The search key rewritten.
In 102, determine the content of text can extensive ingredient, obtain the extensive template of the content of text.
In this step, after extensive to the content of text progress acquired in step 101, the extensive mould of text content is obtained Plate.When carrying out extensive to content of text, it is first determined text content can extensive ingredient, be then based on it is definite can be extensive Ingredient carries out text content extensive.
Specifically, determine content of text can extensive ingredient when, following manner may be employed:First to text content Cutting word processing is carried out, obtains the cutting word result of text content;Then the cutting word result of text content is parsed, obtained Text content includes the part of speech of word;Based on the extensive requirement of default part of speech, determine text content can extensive ingredient. Wherein, default part of speech it is extensive requirement be:At least one of noun, number and time word are carried out extensive.Therefore, this article This content can extensive ingredient include at least one of noun, number and time word.It will additionally be appreciated that institute is extensive Noun can include the words such as place, name, all kinds of specific terms and common noun.
Text content can be carried out extensive after extensive ingredient determine content of text, obtain the general of text content Change result.To in text content can extensive ingredient carry out extensive process, i.e., by content of text can extensive ingredient it is general The process of the part of speech slot position corresponding to the extensive ingredient is turned to, such as " Qingdao " belongs to place noun, then is generalized for " [ Point] ";" October " belongs to time word, then is generalized for " [time] ".
After the extensive result of content of text is obtained, ingredient included in the extensive result of text content is arranged Row combination, so as to obtain the extensive template of text content.This is because ingredient included in the extensive result of content of text May have very much, it is therefore desirable to permutation and combination be carried out to extensive result, to obtain the extensive template of whole of corresponding text content. For example, if the extensive result of certain content of text is " [number] [noun] in related [place] ", which is carried out After permutation and combination, obtained extensive template can include " [number] [noun] in related [place] ", " [number] is related [place] [noun] ", " [number] [noun] related [place] " etc..
The extensive process for obtaining extensive template is carried out to above-mentioned content of text to be illustrated:If desired the text rewritten Content be " 10 secrets in relation to Qingdao ", first to the text content progress cutting word processing, obtain " related ", " Qingdao ", " ", the cutting word result of " 10 " and " secret ";Then part of speech parsing is carried out to cutting word result, determines the part of speech of above-mentioned word, Such as " Qingdao " belongs to place, " 10 " belong to number etc.;Then by " Qingdao " and " secret " that belongs to noun, belong to number " 10 " progress are extensive, and obtained extensive result can be " [number] [noun] in related [place] ", or " related [number] in [place] is secret ", can also be " [number] [noun] in related Qingdao " etc.;Finally to all extensive results into Row permutation and combination, obtains the extensive template of text content, and obtained extensive template can be including " [number] is related [place] [noun] ", " [number] secret related [place] ", " [noun] in [number] related Qingdao " etc..
In 103, matching rewriting template corresponding with the extensive template, and based on the template of rewriting to the text Content is rewritten.
In this step, based on the obtained extensive template of step 102, match with after the rewriting template corresponding to it, The rewriting template obtained based on matching rewrites content of text, so as to obtain the rewriting result of text content.Wherein, often A extensive template can be corresponding at least one rewriting template, therefore can determine phase therewith according to obtained extensive template The rewriting template matched somebody with somebody.
Specifically, previously generated in the following way with the corresponding template of rewriting of extensive template:
(1) parallel corpora of text pair is obtained.
In this step, the parallel corpora of acquired text pair is the relevant text pair on semantic and syntax, i.e., Belong between the text that the text pair of parallel corpora is included it is semantic it is related, syntax is related.
Before the parallel corpora of text pair is obtained, it is necessary first to obtain corpus of text.Text language material can be for Rewrite the query-query language materials of search key, or, can be with for rewriting the title-title language materials of title For query-title language materials.The present embodiment is illustrated exemplified by rewriting title, then acquired corpus of text is title- Title language materials.
Wherein, acquired title-title language materials are the title of whole search results of corresponding a certain search key.Cause This, in the following manner is may be employed when obtaining title-title language materials:Based on daily record is showed, obtain crucial according to a search The obtained whole search results of word, arbitrarily select a pair, as title-title languages from title corresponding with search result Material.
According to the obtained corpus of text of previous step, since the text that each corpus of text is included is same to all corresponding to Search key, therefore can tentatively assert the text to having certain semantic dependency between the text that is included.Therefore After corpus of text is obtained, by determining that each language material includes the alignment score between text, the syntax between text is determined Correlation.
Specifically, in the following manner may be employed and determine that each language material includes the alignment score between text:
1) cutting word processing is carried out to each text first, obtains the cutting word result of each text.
2) default deletion dictionary is used, the ingredient of deleting in the cutting word result is marked.
In the deletion dictionary pre-established, record has many meaningless, deletable words or a phrase, such as " you not Know ", " do you know ", " exposition " etc., these words or phrase will not generate shadow to the semanteme and information content of entire sentence It rings.When establishing the deletion dictionary, it can be counted, will deleted by deletable ingredient in the corpus of text to having obtained Frequency is higher than the constituent of the word or phrase of certain threshold value as the deletion dictionary.
In this step, after the cutting word result for obtaining each text, according to the cutting word result of each text of deletion dictionary lookup In whether containing ingredient can be deleted, if so, then ingredient of deleting contained in the cutting word result of each text is marked.
3) the alignment probability for the ingredient not being labeled between two cutting word results of text pair is determined, it is true using the probability that aligns Determine the alignment score between text pair.
Wherein, the alignment probability for the ingredient not being labeled between two cutting word results of text pair, i.e. text pair text The probability that the ingredient that the probability and text two that one ingredient included occurs in text two are included occurs in text one, Utilize the alignment score between obtained alignment determine the probability text pair.
For example, if the text one of text pair includes 5 ingredients, text two is also comprising 5 ingredients, if one institute of text Comprising 5 ingredients all occur in text two, then text one is 1 with the probability that aligns of text two, if text one included 5 A ingredient has 4 to appear in text two, then text one is 0.8 with the probability that aligns of text two;It can similarly obtain, if text two is wrapped 5 ingredients contained all occur in text one, then text two is 1 with the probability that aligns of text one, if 5 that text two is included Ingredient has 3 to occur in text one, then text two is 0.6 with the probability that aligns of text one.
Utilize the alignment score between the alignment determine the probability text pair between two cutting word results of text pair.For example, The probability that aligns between text one and text two is 1, and the probability that aligns between text two and text one is 0.8, then text is to it Between alignment score can be (1,0.8);Two alignment probability can also be averaged, then the alignment score between text pair For 0.9.It, then can be using the text to the parallel language as text pair when the alignment score between text pair meets preset requirement Material.The mode that predetermined threshold value may be employed determines the text pair for meeting preset requirement.If the alignment between text pair be scored at (1, 0.8), then two alignment probability in the alignment score need to be more than predetermined threshold value simultaneously, can just determine that the text is pre- to meeting If it is required that;If the alignment between text pair is scored at 0.9, when which is more than predetermined threshold value, it is determined that the text To meeting preset requirement.
(2) based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on identified Can extensive ingredient to each text carry out it is extensive.
Based on the extensive requirement of default part of speech, determine each text of text pair can extensive ingredient.In the present embodiment, in advance If part of speech it is extensive requirement be:It is extensive to the progress of at least one of noun, number and time word, i.e., by each text of text pair At least one of this noun included, number and time word are used as can extensive ingredient.Determine each text in can be general After chemical conversion point, each text is carried out extensive.Wherein, each text is carried out extensive process be by each text can extensive ingredient it is general Turn to the process of its corresponding part of speech slot position.It will additionally be appreciated that since ingredient included in extensive result may have Very much, therefore also need to carry out ingredient included in the extensive result of each text permutation and combination, it is all possible to obtain Extensive result.
(3) using the extensive result of a text in each text as extensive template, the extensive result of another text is made For corresponding rewriting template.
The extensive of each text is obtained as a result, the extensive result can be used to indicate that each text after extensive to the progress of each text Sentence structure.Using the extensive result of a text in each text as extensive template, and the extensive result of another text As template is rewritten, i.e. a text is corresponding with rewriting template to obtained extensive template.
The above process is illustrated, the title-title language materials obtained first are " 50 in relation to capital of a country secret It is close " and " on 50 secrets in capital of a country, you both know about ";Using dictionary can be deleted to deleting into above-mentioned two text It point is marked, such as " you both know about " is labeled as to delete ingredient;Then the alignment score between text pair is obtained, if Other word all aligns in two texts in addition to it can delete ingredient, then the alignment score between text pair is all 1, therefore this The text is to that can be used as parallel corpora;After the definite text is to for parallel corpora, based on the extensive requirement of default part of speech Determine above-mentioned two text can extensive ingredient, if " capital of a country [place] ", " 50 [number] ", " secret [noun] " is can be extensive Ingredient, then the extensive result of two texts is " [number] [noun] in related [place] " and " [number] on [place] [noun], you both know about (can delete) ";It can be using extensive result " [number] [noun] in related [place] " as extensive mould " on [number] [noun] in [place], you both know about " is used as corresponding rewriting template by plate.
In this step, before matching rewriting template corresponding with extensive template is carried out, mould can also further be taken Plate expanding policy expands the scope that extensive template matches rewrite template.
Optionally, during one of the present embodiment specific implementation, can in extensive template not by extensive ingredient Carry out synonymous extension.Specifically, to not carrying out synonymous rewriting by extensive ingredient in extensive template, that is, synonym, alias are utilized Deng to not carrying out content replacement by extensive ingredient in extensive template.For example, if extensive template is " who is the old of [name] The extensive template if the synonym of " wife " is " wife ", can be rewritten as " wife who is [name] " by mother-in-law ";It is if general It is " [number] [noun] of programmer " to change template, and the alias of " programmer " is " code agriculture ", then can rewrite the extensive template For " [number] [noun] of code agriculture ".
It can also be based on default compressible structure dictionary, the specific structure in extensive template is compressed.Wherein, Comprising the structure and corresponding compression result that can be compressed in the compressible structure dictionary, such as " can will determine Language+noun " structure compresses are " noun ", can be " noun " etc. by " number+noun " structure compresses.For example, if text Content is " cuisines of Pekinese 10 ", if its extensive template is " Pekinese [number] [noun] ", wherein " 10 cuisines " belong to The structure of " number+noun " then compresses it into " [noun 1] ", then the extensive template of text content becomes " Pekinese's [name Word 1] ".It is understood that when making to carry out templates-Extension in this way, needed when being rewritten to text to compression Structure is reduced, will " [noun 1] " be reduced to " 10 cuisines ".
Match it is corresponding with extensive template rewrite template after, based on the rewriting template that the matching obtains to content of text into Row is rewritten, and will rewrite the corresponding word that extensive ingredient present in template is reduced in text content.Citing comes It says, if content of text to be rewritten is " 10 secrets in relation to Qingdao ", if its extensive template is " [number] in related [place] It is secret ", template of rewriting corresponding with the extensive template if " [number] on [place] is secret, you both know about ", wherein Extensive ingredient " [place] " correspondence " Qingdao ", " [number] " correspondence " 10 ", then final rewriting result is " 10 on Qingdao A secret, you both know about ".
It will additionally be appreciated that due to match it is corresponding with extensive template rewrite template when, may there are multiple Template is rewritten, therefore can be given a mark to multiple rewriting templates, is used when determining according to marking result and rewrite content of text Rewriting template.Wherein, when giving a mark to rewriting template, can be beaten using the evaluation model that advance training obtains Point.
Specifically, which trains to obtain in the following way in advance:Obtain training sample, acquired instruction Practicing sample includes extensive template template pair corresponding with template is rewritten and rewrites the score that template marks in advance;Extract template To matching characteristic, using the matching characteristic for the template pair extracted as input, rewrite the marked score of template as output, Training Logic Regression Models, obtain evaluation model.
Wherein, the matching characteristic of the extensive template extracted template pair corresponding with template is rewritten includes:Slot position alignment letter Breath, including slot position aligned registry probability, reversely align probability, alignment number etc.;Slot position term vector similarity calculates slot position word The cosine similarities of vector;Slot position proper name similarity using classification special dictionary, judges whether slot position belongs to generic;Slot The literal similarity in position calculates similarity behind each slot position of cutting to word rank;Back is replaced in slot bit boundary language model value, slot position Clear and coherent degree at boundary;Text justification degree, determines whether all unjustified ingredients occur in the body of the email;Template alignment number, it is right Template is counted, and embodies the confidence level of template;Score is estimated in click, and the click of template is carried out in advance using prediction model is clicked on The score estimated.
After in-service evaluation model gives a mark to rewriting template, according to the score corresponding to each rewriting template, it will meet The rewriting template of preset requirement is as final rewriting template.If the score for respectively rewriting template differs, by score most High rewriting template is as final rewriting template;If the rewriting template of highest scoring has multiple, therefrom make for optional one For final rewriting template.After final rewriting template is determined, content of text is rewritten using the rewriting template, it will In text content can extensive ingredient reduced, obtain the rewriting result of text content.
Fig. 2 is the structure drawing of device that the text that one embodiment of the invention provides is rewritten, as shown in Figure 2, described device bag It includes:Acquiring unit 21, extensive unit 22, generation unit 23, rewriting unit 24 and training unit 25.
Acquiring unit 21, for obtaining content of text to be rewritten.
Content of text to be rewritten acquired in acquiring unit 21 can be the title rewritten, or need The search key rewritten.
Extensive unit 22, for determine the content of text can extensive ingredient, obtain the extensive mould of the content of text Plate.
After extensive unit 22 is extensive to the content of text progress acquired in acquiring unit 21, the extensive of text content is obtained Template.When carrying out extensive to content of text, extensive unit 22 determine first text content can extensive ingredient, be then based on It is definite can extensive ingredient text content is carried out it is extensive.
Specifically, extensive unit 22 determine content of text can extensive ingredient when, following manner may be employed:It is right first Text content carries out cutting word processing, obtains the cutting word result of text content;Then to the cutting word result of text content into Row parsing obtains the part of speech that text content includes word;Based on the extensive requirement of default part of speech, text content is determined It can extensive ingredient.Wherein, default part of speech it is extensive requirement be:At least one of noun, number and time word are carried out general Change.Therefore, text content can extensive ingredient include at least one of noun, number and time word.In addition can manage Solution, extensive noun can include the words such as place, name, all kinds of specific terms and common noun.
Extensive unit 22 can carry out text content extensive after extensive ingredient determine content of text, obtain this article The extensive result of this content.Extensive unit 22 in text content can extensive ingredient carry out extensive process, i.e., by text In content can extensive ingredient be generalized for the process of the part of speech slot position corresponding to the extensive ingredient, such as " Qingdao " belongs to location name Word is then generalized in " [place] ";" October " belongs to time word, then is generalized for " [time] ".
Extensive unit 22, can also be to institute in the extensive result of text content after the extensive result of content of text is obtained Comprising ingredient carry out permutation and combination, so as to obtain the extensive template of text content.This is because the extensive knot of content of text Ingredient included in fruit may have very much, it is therefore desirable to permutation and combination be carried out to extensive result, to obtain in the corresponding text The extensive template of whole of appearance.For example, if the extensive result of certain content of text is " [number] [noun] in related [place] ", After carrying out permutation and combination to the extensive result, obtained extensive template can include " [number] [noun] in related [place] ", " [noun] in [number] related [place] ", " [number] [noun] related [place] " etc..
Generation unit 23, for previously generating rewriting template corresponding with extensive template.
Generation unit 23 may be employed when for previously generating rewriting template corresponding with extensive template such as lower section Formula:
(1) parallel corpora of text pair is obtained.
The parallel corpora of text pair acquired in generation unit 23 is the relevant text pair on semantic and syntax, that is, is belonged to Semantic related, syntax correlation between the text included in the text pair of parallel corpora.
Generation unit 23 is before the parallel corpora of text pair is obtained, it is necessary first to obtain corpus of text.Text language material Can be for rewriting the query-query language materials of search key, or for rewriting the title-title languages of title Material, can also be query-title language materials.The present embodiment is illustrated exemplified by rewriting title, then acquired corpus of text For title-title language materials.
Wherein, the title-title language materials acquired in generation unit 23 are searched for for the whole of corresponding a certain search key As a result title.Therefore, in the following manner may be employed when obtaining title-title language materials in generation unit 23:Based on showing day Will is obtained according to the obtained whole search results of a search key, arbitrarily chosen from title corresponding with search result Choosing is a pair of, as title-title language materials.
According to the obtained corpus of text of previous step, since the text that each corpus of text is included is same to all corresponding to Search key, therefore can tentatively assert the text to having certain semantic dependency between the text that is included.Therefore After corpus of text is obtained, generation unit 23 determines text by determining that each language material includes the alignment score between text Between syntax correlation.
Specifically, generation unit 23 may be employed in the following manner and determine alignment score between text pair:
1) cutting word processing is carried out to each text first, obtains the cutting word result of each text.
2) default deletion dictionary is used, the ingredient of deleting in the cutting word result is marked.
In the deletion dictionary pre-established, record has many meaningless, deletable words or a phrase, such as " you not Know ", " do you know ", " exposition " etc., these words or phrase will not generate shadow to the semanteme and information content of entire sentence It rings.When establishing the deletion dictionary, it can be counted, will deleted by deletable ingredient in the corpus of text to having obtained Frequency is higher than the constituent of the word or phrase of certain threshold value as the deletion dictionary.
After generation unit 23 obtains the cutting word result of each text, according in the cutting word result of each text of deletion dictionary lookup Whether containing ingredient can be deleted, if so, then ingredient of deleting contained in the cutting word result of each text is marked.
3) the alignment probability for the ingredient not being labeled between two cutting word results of text pair is determined, it is true using the probability that aligns Determine the alignment score between text pair.
Wherein, the alignment probability for the ingredient not being labeled between two cutting word results of text pair, i.e. text pair text The probability that the ingredient that the probability and text two that one ingredient included occurs in text two are included occurs in text one, Utilize the alignment score between obtained alignment determine the probability text pair.
For example, if the text one of text pair includes 5 ingredients, text two is also comprising 5 ingredients, if one institute of text Comprising 5 ingredients all occur in text two, then text one is 1 with the probability that aligns of text two, if text one included 5 A ingredient has 4 to appear in text two, then text one is 0.8 with the probability that aligns of text two;It can similarly obtain, if text two is wrapped 5 ingredients contained all occur in text one, then text two is 1 with the probability that aligns of text one, if 5 that text two is included Ingredient has 3 to occur in text one, then text two is 0.6 with the probability that aligns of text one.
Generation unit 23 utilizes the alignment between the alignment determine the probability text pair between two cutting word results of text pair Score.For example, if the probability that aligns between text one and text two is 1, the probability that aligns between text two and text one is 0.8, then the alignment score between text pair can be (1,0.8);Two alignment probability can also be averaged, then text pair Between alignment be scored at 0.9.When the alignment score between text pair meets preset requirement, generation unit 23 can be by this article This is to the parallel corpora as text pair.The mode that predetermined threshold value may be employed in generation unit 23 determines the text for meeting preset requirement This is right.If the alignment between text pair is scored at (1,0.8), then two alignment probability in the alignment score need to be more than pre- simultaneously If threshold value, it can just determine the text to meeting preset requirement;If the alignment between text pair is scored at 0.9, the alignment When score is more than predetermined threshold value, it is determined that the text is to meeting preset requirement.
(2) based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on identified Can extensive ingredient to each text carry out it is extensive.
Generation unit 23 is based on the extensive requirement of default part of speech, and determine each text of text pair can extensive ingredient.At this In embodiment, the default extensive requirement of part of speech is:It is extensive to the progress of at least one of noun, number and time word, it will At least one of noun, number and time word that each text of text pair is included is used as can extensive ingredient.Generation unit 23 can carry out each text extensive after extensive ingredient in each text is determined.Wherein, generation unit 23 carries out each text general The process of change be by each text can extensive ingredient be generalized for the process of its corresponding part of speech slot position.It will additionally be appreciated that Since ingredient included in extensive result may have very much, generation unit 23 can also be in the extensive result of each text Comprising ingredient carry out permutation and combination, to obtain all possible extensive result.
(3) using the extensive result of a text in each text as extensive template, the extensive result of another text is made For corresponding rewriting template.
Generation unit 23 obtains the extensive of each text as a result, the extensive result can be used after extensive to the progress of each text In the sentence structure for representing each text.Generation unit 23 using the extensive result of a text in each text as extensive template, And the extensive result of another text is used as and rewrites template, i.e. a text is opposite with rewriting template to obtained extensive template It should.
Unit 24 is rewritten, for matching rewriting template corresponding with the extensive template, and template pair is rewritten based on described The content of text is rewritten.
Rewrite unit 24 be based on the obtained extensive template of extensive unit 22, previously generated using generation unit 23 with it is general After the rewriting template for changing template matches, the rewriting template obtained based on matching rewrites content of text, so as to obtain this article The rewriting result of this content.Wherein, each extensive template can be corresponding at least one rewriting template, therefore according to obtained Extensive template can determine matched rewriting template.
Unit 24 is rewritten before matching rewriting template corresponding with extensive template is carried out, can also further take template Expanding policy expands the scope that extensive template matches rewrite template.
Optionally, during one of the present embodiment specific implementation, rewrite unit 24 can in extensive template not by Extensive ingredient carries out synonymous extension.Specifically, unit 24 is rewritten to not carrying out synonymous change by extensive ingredient in extensive template It writes, i.e., content replacement is carried out to ingredient included in extensive template using synonym, alias etc..For example, if extensive mould Plate is " who is the wife of [name] ", if the synonym of " wife " is " wife ", rewriting unit 24 can be by the extensive template It is rewritten as " wife who is [name] ";If extensive template is " [number] [noun] of programmer ", the alias of " programmer " is " code agriculture ", then the extensive template can be rewritten as " [number] [noun] of code agriculture " by rewriting unit 24.
It can also be based on default compressible structure dictionary, rewrite unit 24 and the specific structure in extensive template is carried out Compression.Wherein, comprising the structure and corresponding compression result that can be compressed in the compressible structure dictionary, such as It can be " noun " by " attribute+noun " structure compresses, can be " noun " etc. by " number+noun " structure compresses.Citing comes It says, if content of text is " cuisines of Pekinese 10 ", if its extensive template is " Pekinese [number] [noun] ", wherein " 10 Cuisines " belong to the structure of " number+noun ", then rewrite unit 24 and compress it into " [noun 1] ", then text content is extensive Template becomes " Pekinese [noun 1] ".It is understood that when making to carry out templates-Extension in this way, to text into Row rewrite when need to reduce pressure texture, will " [noun 1] " be reduced to " 10 cuisines ".
Match it is corresponding with extensive template rewrite template after, based on the rewriting template that the matching obtains to content of text into Row is rewritten, and will rewrite the corresponding word that extensive ingredient present in template is reduced in text content.Citing comes It says, if content of text to be rewritten is " 10 secrets in relation to Qingdao ", if its extensive template is " [number] in related [place] It is secret ", template of rewriting corresponding with the extensive template if " [number] on [place] is secret, you both know about ", wherein Extensive ingredient " [place] " correspondence " Qingdao ", " [number] " correspondence " 10 ", then final rewriting result is " 10 on Qingdao A secret, you both know about ".
It will additionally be appreciated that since multiple rewriting templates when being rewritten to content of text, may be obtained, then change After r/w cell 24 can also give a mark to the multiple rewriting templates obtained, final rewriting mould is determined according to marking result Plate.It, will according to the score corresponding to each rewriting template after rewriting 24 in-service evaluation model of unit and giving a mark to rewriting template Meet the rewriting template of preset requirement as final rewriting template.If the score for respectively rewriting template differs, will Divide highest rewriting template as final rewriting template;If the rewriting template of highest scoring has multiple, therefrom optional one It is a as final rewriting template.Unit 24 is rewritten to rewrite content of text using identified final rewriting template, To obtain the rewriting result of text content.
Training unit 25 obtains evaluation model for training in advance.
Unit 24 is rewritten to be obtained by the training of training unit 25 to rewriting used evaluation model when template is given a mark.
Specifically, training unit 25 is to train to obtain the evaluation model in the following way in advance:
Training sample is obtained, it is corresponding with template is rewritten that the training sample acquired in training unit 25 includes extensive template The score that template pair and rewriting template mark in advance;After training unit 25 extracts the matching characteristic of template pair, by what is extracted Matching characteristic rewrites the marked score of template as output, training Logic Regression Models obtain evaluation model as input.
Wherein, the matching characteristic of the extensive template that training unit 25 is extracted template pair corresponding with template is rewritten includes: Slot position alignment information, including slot position aligned registry probability, reversely align probability, alignment number etc.;Slot position term vector similarity, i.e., Calculate the cosine similarities of slot position term vector;Slot position proper name similarity using classification special dictionary, judges whether slot position belongs to It is generic;The literal similarity in slot position calculates similarity behind each slot position of cutting to word rank;Slot bit boundary language model value, slot Replace the clear and coherent degree at back boundary in position;Text justification degree, determines whether all unjustified ingredients occur in the body of the email;Template pair Homogeneous number, counts template, embodies the confidence level of template;Score is estimated in click, using click prediction model to template Click on the score estimated.
Fig. 3 shows to be used for the frame for the exemplary computer system/server 012 for realizing embodiment of the present invention Figure.The computer system/server 012 that Fig. 3 is shown is only an example, function that should not be to the embodiment of the present invention and use Range band carrys out any restrictions.
As shown in figure 3, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes The component of business device 012 can include but is not limited to:One or more processor or processing unit 016, system storage 028, the bus 018 of connection different system component (including system storage 028 and processing unit 016).
Bus 018 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises various computing systems readable medium.These media can be appointed What usable medium that can be accessed by computer system/server 012, including volatile and non-volatile medium, movably With immovable medium.
System storage 028 can include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can For reading and writing immovable, non-volatile magnetic media (Fig. 3 is not shown, is commonly referred to as " hard disk drive ").Although in Fig. 3 Be not shown, can provide for move non-volatile magnetic disk (such as " floppy disk ") read-write disc driver and pair can The CD drive that mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these situations Under, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 can include At least one program product, the program product have one group of (for example, at least one) program module, these program modules are configured To perform the function of various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can be stored in such as memory In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other Program module and program data may include the realization of network environment in each or certain combination in these examples.Journey Sequence module 042 usually performs function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment, Display 024 etc.) communication, in the present invention, computer system/server 012 communicates with outside radar equipment, can also be with One or more enables a user to the equipment interacted with the computer system/server 012 communication and/or with causing the meter Any equipment that calculation machine systems/servers 012 can communicate with one or more of the other computing device (such as network interface card, modulation Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes Being engaged in device 012 can also be by network adapter 020 and one or more network (such as LAN (LAN), wide area network (WAN) And/or public network, such as internet) communication.As shown in the figure, network adapter 020 by bus 018 and computer system/ Other modules communication of server 012.It should be understood that although not shown in the drawings, computer system/server 012 can be combined Using other hardware and/or software module, include but not limited to:Microcode, device driver, redundant processing unit, external magnetic Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 is stored in program in system storage 028 by operation, so as to perform various functions using with And data processing, such as realize a kind of method that text is rewritten, it can include:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and based on the rewriting template to content of text progress It rewrites.
Above-mentioned computer program can be arranged in computer storage media, i.e., the computer storage media is encoded with Computer program, the program by one or more computers when being performed so that one or more computers are performed in the present invention State the method flow shown in embodiment and/or device operation.For example, the method stream performed by said one or multiple processors Journey can include:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and based on the rewriting template to content of text progress It rewrites.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by Tangible medium, can also directly be downloaded from network etc..Any combination of one or more computer-readable media may be employed. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device or The arbitrary above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes:There are one tools Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can To be any tangible medium for including or storing program, the program can be commanded execution system, device or device use or Person is in connection.
Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.Diversified forms may be employed in the data-signal of this propagation, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium beyond computer readable storage medium, which can send, propagate or Transmission for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.It can be with one or more programmings Language or its combination write to perform the computer program code that operates of the present invention, described program design language include towards The programming language of object-such as Java, Smalltalk, C++ further includes conventional procedural programming language-all Such as " C " language or similar programming language.Program code can perform fully on the user computer, partly with On the computer of family perform, the software package independent as one perform, part on the user computer part on the remote computer It performs or performs on a remote computer or server completely.In the situation for being related to remote computer, remote computer can To pass through the network of any kind --- subscriber computer is connected to including LAN (LAN) or wide area network (WAN), alternatively, can To be connected to outer computer (such as passing through Internet connection using ISP).
Using technical solution provided by the present invention, by obtaining extensive template, Jin Ergen to content of text progress is extensive According to the corresponding rewriting template of obtained extensive template matches, content of text is carried out according to the rewriting template that matching obtains It rewrites, possesses the function of increasing/delete fractions, it is larger to rewrite the conversion degree of result, so as to reach to content of text The effect of whole sentence rewriting is carried out, enables to user more obvious to the perception of revised text.
In several embodiments provided by the present invention, it should be understood that disclosed system, apparatus and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit Division is only a kind of division of logic function, can there is other dividing mode in actual implementation.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical location, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be employed in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment the method.And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various The medium of program code can be stored.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.

Claims (25)

1. a kind of method that text is rewritten, which is characterized in that the described method includes:
Obtain content of text to be rewritten;
Determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Matching rewriting template corresponding with the extensive template, and the content of text is changed based on the template of rewriting It writes.
2. according to the method described in claim 1, it is characterized in that, it is described determine the content of text can be extensive into subpackage It includes:
Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;
The cutting word result is parsed, obtains the part of speech of each word in the content of text;
Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
3. according to the method described in claim 2, it is characterized in that, the extensive requirement of the default part of speech is:To content of text In at least one of noun, number and time word carry out it is extensive.
4. according to the method described in claim 1, it is characterized in that, the extensive template for obtaining the content of text includes:
Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;
The extensive template of the content of text is obtained using each extensive result.
5. according to the method described in claim 1, it is characterized in that, the rewriting template corresponding with extensive template is using such as What under type previously generated:
Obtain the parallel corpora of text pair;
Based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on it is identified can it is extensive into Divide extensive to the progress of each text;
Using the extensive result of a text in each text as extensive template, the extensive result of another text as with Its corresponding rewriting template.
6. method according to claim 4 or 5, which is characterized in that it is described it is extensive including:
Can extensive ingredient be generalized for its corresponding part of speech slot position, wherein carry out permutation and combination to each extensive ingredient when extensive, Obtain each extensive result.
7. according to the method described in claim 5, it is characterized in that, the parallel corpora of the text pair is to obtain in the following way It arrives:
Obtain corpus of text;
Determine the alignment score between arbitrary text pair in the corpus of text;
Alignment score is met into the text of preset requirement to the parallel corpora as text pair.
8. according to the method described in claim 6, it is characterized in that, described determine in the corpus of text between arbitrary text pair Alignment score include:
Cutting word processing is carried out to each text, obtains the cutting word result of each text;
The ingredient of deleting in the cutting word result is marked using default deletion dictionary;
It determines the alignment probability for the ingredient not being labeled between two cutting word results of the text pair, utilizes the alignment probability Determine the alignment score between text pair.
9. according to the method described in claim 1, it is characterized in that, in matching rewriting mould corresponding with the extensive template Before plate, further include:
To not carrying out synonymous extension by extensive ingredient in the extensive template;Or
Using default compressible structure dictionary, the specific structure included in the extensive template is compressed.
10. according to the method described in claim 1, it is characterized in that, the method further includes:
The rewriting template that in-service evaluation model obtains matching is given a mark;
According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite the content of text.
11. according to the method described in claim 9, it is characterized in that, the evaluation model is to train in advance in the following way It obtains:
Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, it is advance to rewrite template The score of mark;
Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting, Obtain evaluation model.
12. according to the method for claim 11, which is characterized in that the matching characteristic between the template pair includes:Slot position Alignment information, slot position term vector similarity, slot position proper name similarity, the literal similarity in slot position, slot bit boundary language model value, just At least one of score is estimated in literary degree of registration, template alignment number and click.
13. the device that a kind of text is rewritten, which is characterized in that described device includes:
Acquiring unit, for obtaining content of text to be rewritten;
Extensive unit, for determine the content of text can extensive ingredient, obtain the extensive template of the content of text;
Unit is rewritten, for matching rewriting template corresponding with the extensive template, and based on the template of rewriting to the text This content is rewritten.
14. device according to claim 13, which is characterized in that the extensive unit is used to determine the content of text Can extensive ingredient when, it is specific to perform:
Cutting word processing is carried out to the content of text, obtains the cutting word result of the content of text;
The cutting word result is parsed, obtains the part of speech of each word in the content of text;
Based on the extensive requirement of default part of speech, determine the content of text can extensive ingredient.
15. device according to claim 14, which is characterized in that the extensive requirement of default part of speech is:To in text At least one of noun, number and time word in appearance carry out extensive.
16. device according to claim 13, which is characterized in that the extensive unit is used to obtain the content of text It is specific to perform during extensive template:
Based on it is definite can extensive ingredient the content of text is carried out extensive, obtain each extensive result;
The extensive template of the content of text is obtained using each extensive result.
17. device according to claim 13, which is characterized in that described device further includes generation unit, for advance It is specific to perform when generating rewriting template corresponding with extensive template:
Obtain the parallel corpora of text pair;
Based on the extensive requirement of default part of speech determine text pair each text can extensive ingredient, based on it is identified can it is extensive into Divide extensive to the progress of each text;
Using the extensive result of a text in each text as extensive template, the extensive result of another text as with Its corresponding rewriting template.
18. the device according to claim 16 or 17, which is characterized in that the extensive unit or generation unit progress are extensive When, it is specific to perform:
Can extensive ingredient be generalized for its corresponding part of speech slot position, wherein carry out permutation and combination to each extensive ingredient when extensive, Obtain each extensive result.
19. device according to claim 17, which is characterized in that the generation unit obtains the parallel language of the text pair It is specific to perform during material:
Obtain corpus of text;
Determine the alignment score between arbitrary text pair in the corpus of text;
Alignment score is met into the text of preset requirement to the parallel corpora as text pair.
20. device according to claim 19, which is characterized in that the generation unit is appointed in the corpus of text is determined It is specific to perform during the alignment score between text pair of anticipating:
Cutting word processing is carried out to each text, obtains the cutting word result of each text;
The ingredient of deleting in the cutting word result is marked using default deletion dictionary;
It determines the alignment probability for the ingredient not being labeled between two cutting word results of the text pair, utilizes the alignment probability Determine the alignment score between text pair.
21. device according to claim 13, which is characterized in that the rewriting unit is in matching and the extensive template pair Before the rewriting template answered, also perform:
To not carrying out synonymous extension by extensive ingredient in the extensive template;Or
Using default compressible structure dictionary, the specific structure included in the extensive template is compressed.
22. device according to claim 13, which is characterized in that the rewriting unit is additionally operable to perform:
The rewriting template that in-service evaluation model obtains matching is given a mark;
According to marking as a result, the rewriting template for meeting preset requirement is used to rewrite the content of text.
23. device according to claim 22, which is characterized in that described device further includes training unit, for instructing in advance It is specific to perform when getting evaluation model:
Training sample is obtained, the training sample includes extensive template template pair corresponding with template is rewritten, it is advance to rewrite template The score of mark;
Using the matching characteristic of the template pair as input, the marked score trains Logic Regression Models as exporting, Obtain evaluation model.
24. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are performed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-12.
25. a kind of storage medium for including computer executable instructions, the computer executable instructions are by computer disposal Method when device performs for execution as described in any in claim 1-12.
CN201711138896.XA 2017-11-16 2017-11-16 Method, device and equipment for text rewriting and computer storage medium Active CN108121697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711138896.XA CN108121697B (en) 2017-11-16 2017-11-16 Method, device and equipment for text rewriting and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711138896.XA CN108121697B (en) 2017-11-16 2017-11-16 Method, device and equipment for text rewriting and computer storage medium

Publications (2)

Publication Number Publication Date
CN108121697A true CN108121697A (en) 2018-06-05
CN108121697B CN108121697B (en) 2022-02-25

Family

ID=62228457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711138896.XA Active CN108121697B (en) 2017-11-16 2017-11-16 Method, device and equipment for text rewriting and computer storage medium

Country Status (1)

Country Link
CN (1) CN108121697B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241286A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating text
CN109739968A (en) * 2018-12-29 2019-05-10 北京猎户星空科技有限公司 A kind of data processing method and device
CN109766537A (en) * 2019-01-16 2019-05-17 北京未名复众科技有限公司 Study abroad document methodology of composition, device and electronic equipment
CN110309280A (en) * 2019-05-27 2019-10-08 重庆小雨点小额贷款有限公司 A kind of corpus expansion method and relevant device
CN111666775A (en) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113822034A (en) * 2021-06-07 2021-12-21 腾讯科技(深圳)有限公司 Method and device for repeating text, computer equipment and storage medium
CN113935306A (en) * 2021-09-14 2022-01-14 有米科技股份有限公司 Method and device for processing advertising pattern template
CN115713071A (en) * 2022-11-11 2023-02-24 北京百度网讯科技有限公司 Training method of neural network for processing text and method for processing text

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442546A (en) * 1991-11-29 1995-08-15 Hitachi, Ltd. System and method for automatically generating translation templates from a pair of bilingual sentences
CN101346716A (en) * 2005-12-22 2009-01-14 国际商业机器公司 A method and system for editing text with a find and replace function leveraging derivations of the find and replace input
CN101470700A (en) * 2007-12-28 2009-07-01 日电(中国)有限公司 Text template generator, text generation equipment, text checking equipment and method thereof
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN103020040A (en) * 2011-09-27 2013-04-03 富士通株式会社 Rewriting processing method and equipment of source languages, and machine translation system
CN103186509A (en) * 2011-12-29 2013-07-03 北京百度网讯科技有限公司 Wildcard character class template generalization method and device and general template generalization method and system
CN103678270A (en) * 2012-08-31 2014-03-26 富士通株式会社 Semantic unit extracting method and semantic unit extracting device
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106650943A (en) * 2016-10-28 2017-05-10 北京百度网讯科技有限公司 Auxiliary writing method and apparatus based on artificial intelligence
JP2017129994A (en) * 2016-01-19 2017-07-27 日本電信電話株式会社 Sentence rewriting device, method, and program

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442546A (en) * 1991-11-29 1995-08-15 Hitachi, Ltd. System and method for automatically generating translation templates from a pair of bilingual sentences
CN101346716A (en) * 2005-12-22 2009-01-14 国际商业机器公司 A method and system for editing text with a find and replace function leveraging derivations of the find and replace input
CN101470700A (en) * 2007-12-28 2009-07-01 日电(中国)有限公司 Text template generator, text generation equipment, text checking equipment and method thereof
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN103020040A (en) * 2011-09-27 2013-04-03 富士通株式会社 Rewriting processing method and equipment of source languages, and machine translation system
CN103186509A (en) * 2011-12-29 2013-07-03 北京百度网讯科技有限公司 Wildcard character class template generalization method and device and general template generalization method and system
CN103678270A (en) * 2012-08-31 2014-03-26 富士通株式会社 Semantic unit extracting method and semantic unit extracting device
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
JP2017129994A (en) * 2016-01-19 2017-07-27 日本電信電話株式会社 Sentence rewriting device, method, and program
CN106650943A (en) * 2016-10-28 2017-05-10 北京百度网讯科技有限公司 Auxiliary writing method and apparatus based on artificial intelligence

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘圆圆: "基于模板的对几种特殊结构句子的语句改写", 《现代电子技术》 *
林燕芬: "基于模板的汉语复句改写方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
桑亚辉: "基于模板方法的汉语语句自动改写研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
谢碧清: "中文句式改写算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241286A (en) * 2018-09-21 2019-01-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating text
CN109739968A (en) * 2018-12-29 2019-05-10 北京猎户星空科技有限公司 A kind of data processing method and device
CN109766537A (en) * 2019-01-16 2019-05-17 北京未名复众科技有限公司 Study abroad document methodology of composition, device and electronic equipment
CN110309280A (en) * 2019-05-27 2019-10-08 重庆小雨点小额贷款有限公司 A kind of corpus expansion method and relevant device
CN110309280B (en) * 2019-05-27 2021-11-09 重庆小雨点小额贷款有限公司 Corpus expansion method and related equipment
CN111666775A (en) * 2020-05-21 2020-09-15 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN111666775B (en) * 2020-05-21 2023-08-22 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113822034A (en) * 2021-06-07 2021-12-21 腾讯科技(深圳)有限公司 Method and device for repeating text, computer equipment and storage medium
CN113822034B (en) * 2021-06-07 2024-04-19 腾讯科技(深圳)有限公司 Method, device, computer equipment and storage medium for replying text
CN113935306A (en) * 2021-09-14 2022-01-14 有米科技股份有限公司 Method and device for processing advertising pattern template
CN115713071A (en) * 2022-11-11 2023-02-24 北京百度网讯科技有限公司 Training method of neural network for processing text and method for processing text
CN115713071B (en) * 2022-11-11 2024-06-18 北京百度网讯科技有限公司 Training method for neural network for processing text and method for processing text

Also Published As

Publication number Publication date
CN108121697B (en) 2022-02-25

Similar Documents

Publication Publication Date Title
CN108121697A (en) Method, apparatus, equipment and the computer storage media that a kind of text is rewritten
CN107204184B (en) Audio recognition method and system
CN107908635B (en) Method and device for establishing text classification model and text classification
CN109657054B (en) Abstract generation method, device, server and storage medium
US10102191B2 (en) Propagation of changes in master content to variant content
US20200210468A1 (en) Document recommendation method and device based on semantic tag
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
JP5744228B2 (en) Method and apparatus for blocking harmful information on the Internet
WO2020233269A1 (en) Method and apparatus for reconstructing 3d model from 2d image, device and storage medium
CN109493977A (en) Text data processing method, device, electronic equipment and computer-readable medium
CN111597351A (en) Visual document map construction method
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US20120047172A1 (en) Parallel document mining
CN105210055B (en) According to the hyphenation device across languages phrase table
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN110276023A (en) POI changes event discovery method, apparatus, calculates equipment and medium
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
US20180101521A1 (en) Avoiding sentiment model overfitting in a machine language model
CN111259262A (en) Information retrieval method, device, equipment and medium
CN111008309A (en) Query method and device
CN110457683A (en) Model optimization method, apparatus, computer equipment and storage medium
CN114595686A (en) Knowledge extraction method, and training method and device of knowledge extraction model
US11074402B1 (en) Linguistically consistent document annotation
CN108268602A (en) Analyze method, apparatus, equipment and the computer storage media of text topic point
JP2022093317A (en) Computer-implemented method, system and computer program product (recognition and restructuring of previously presented information)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant