CN107247706A

CN107247706A - Text punctuate method for establishing model, punctuate method, device and computer equipment

Info

Publication number: CN107247706A
Application number: CN201710458179.9A
Authority: CN
Inventors: 谢瑜; 张昊; 朱频频
Original assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co Ltd; Shanghai Zhizhen Intelligent Network Technology Co Ltd; China Electronics Standardization Institute
Current assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co Ltd; Shanghai Zhizhen Intelligent Network Technology Co Ltd; China Electronics Standardization Institute
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2017-10-13
Anticipated expiration: 2037-06-16
Also published as: CN107247706B

Abstract

The present invention provides a kind of text punctuate method for establishing model, punctuate method, device and computer equipment, and the method that text punctuate model is set up includes：Participle is carried out to training corpus, the corresponding word of the training corpus is obtained；For institute's predicate addition characteristic information, the characteristic information includes pause information；Using condition random field algorithm, the characteristic information based on institute's predicate is trained to the corresponding word of the training corpus, to obtain the text punctuate model.The correspondence above method, the present invention also provides a kind of punctuate method, device and computer equipment.

Description

Text punctuate method for establishing model, punctuate method, device and computer equipment

Technical field

The present invention relates to technical field of intelligent interaction, more particularly to the method and device that a kind of text punctuate model is set up.

Background technology

At present, communication interaction is carried out by way of voice more and more universal, and voice content is passed through into writing text shape After formula is preserved, the text usually not punctuation mark of preservation, even without intermittent information, this text just to preserving This progress is read and understood there is provided obstacle.

The content of the invention

The present invention provides a kind of method that text punctuate model is set up, can be more accurately to the number without pause information According to being made pauses in reading unpunctuated ancient writings.

According to above-mentioned purpose, the present invention provides a kind of method that text punctuate model is set up, and methods described includes：To training Language material carries out participle, obtains the corresponding word of the training corpus；For institute's predicate addition characteristic information, the characteristic information includes stopping Information；Using condition random field algorithm, the characteristic information based on institute's predicate is trained to the corresponding word of the training corpus, To obtain the text punctuate model.

In one embodiment, methods described also includes：Test data is made pauses in reading unpunctuated ancient writings with the text punctuate model, obtained To punctuate result；Judge whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value；If it is not, then adjusting the bar The characteristic information frequency threshold value parameter and/or fitting parameter of part random field algorithm, until passing through characteristic information frequency threshold value parameter And/or the standard for the punctuate result that obtained text punctuate model made pauses in reading unpunctuated ancient writings to the training corpus is trained after fitting parameter adjustment True rate is more than or equal to accuracy rate threshold value, then obtained text punctuate model will be trained to break as final text after the adjustment Sentence model.

The correspondence above method, the present invention also provides the device that a kind of text punctuate model is set up, and described device includes：Participle Module, for carrying out participle to training corpus, obtains the corresponding word of the training corpus；Characteristic information add module, for for Institute's predicate addition characteristic information, the characteristic information includes pause information；Training module, for utilizing condition random field algorithm, Characteristic information based on institute's predicate is trained to the corresponding word of the training corpus, to obtain the text punctuate model.

In one embodiment, described device also includes：Test module, for using the text punctuate model to test number According to being made pauses in reading unpunctuated ancient writings, punctuate result is obtained；Accuracy rate judge module, for judge the accuracy rate of the punctuate result whether be more than or Equal to accuracy rate threshold value；If accuracy rate judge module 305 is judged as NO, described device also includes：Parameter adjustment module, is used In the characteristic information frequency threshold value parameter and/or fitting parameter that adjust the condition random field algorithm, until passing through characteristic information Train what obtained text punctuate model was made pauses in reading unpunctuated ancient writings to the training corpus after frequency threshold value parameter and/or fitting parameter adjustment The accuracy rate of punctuate result is more than or equal to accuracy rate threshold value, then will be trained after the adjustment obtained text punctuate model as Final text punctuate model.

The present invention applies to condition random field algorithm in the training of punctuate model, has fully used condition random field algorithm Advantage so that punctuate model punctuate accuracy rate it is higher.

Brief description of the drawings

Fig. 1 shows the flow chart for the method one side that a kind of text punctuate model of the invention is set up；

Fig. 2 shows the flow chart for the method other side that a kind of text punctuate model of the invention is set up；

Fig. 3 shows the flow chart for the method made pauses in reading unpunctuated ancient writings to speech data；

Fig. 4 shows the mechanism map for the device that a kind of text punctuate model of one aspect of the invention is set up.

Embodiment

In order to add pause information to the text without the mark that pauses, the present invention provides the side that a kind of text punctuate model is set up The data for including pause information are trained by method using condition random field algorithm, are obtained punctuate model, are obtained using training Punctuate model to without pause mark text make pauses in reading unpunctuated ancient writings.

In one embodiment, Fig. 1 is refer to, Fig. 1 shows the method one that a kind of text punctuate model of the invention is set up The flow chart of aspect, methods described includes：

Step 101：Participle is carried out to training corpus, the corresponding word of training corpus is obtained；

Step 102：Characteristic information is added for word, characteristic information includes pause information；

Step 103：Using condition random field algorithm, word-based characteristic information is instructed to the corresponding word of training corpus Practice, to obtain text punctuate model.

Training corpus is the text data at least with pause information, can have punctuation mark, the mark in training corpus Point symbol is the pause symbol of training corpus, can mark pause information based on punctuation mark and based on artificial verification, Can not have punctuation mark in training corpus, be based only upon artificial verification mark pause information.Namely need with having had There are the data of punctuate information, to carry out the training of punctuate model.The rule of punctuate information in training corpus is found out, and then sets up disconnected Sentence model.

, can also be for different field point because sentence and word have certain regularity used in different fields The corresponding punctuate model in the field is not trained, for example, can be respectively to telecommunications customer service field, military field, financial field, science and technology Different punctuate models are set up in field etc. respectively.

Because word is the base unit of text composition, and in the middle of a word be not in intermittent information, step 101 pair instruction Practice language material and carry out participle so that the text that need to be made pauses in reading unpunctuated ancient writings is converted into corresponding word.

Word has many attributes, i.e. feature, such as part of speech, semanteme, sentence element (such as subject, predicate, object). Whether these attributes of word generally have contact with the word at the pause of sentence, by consider word above-mentioned attribute and each The correlation of position between word, you can show which word should be located at the pause of sentence.

Such as one sentence " formally sign the amount of money with AM General a few days ago and assisted up to 3.06 hundred million dollars of buying by Shanghai GM View, vehicle and parts for purchasing the latter." " agreement " and " parts " at pause be all noun, that is to say, that have In the case of certain, the possibility for occurring pause information behind noun is very high.

Step 102 is performed, is word addition characteristic information, characteristic information includes pause information.Consider first, with training language The existing pause information of word of material, to carry out the training of punctuate model, that is to say, that find out the training with pause information The pause rule of the corresponding word of language material, punctuate model is set up using the pause rule.

It is preferred that, the pause information based on training corpus carries out the training of punctuate model.

It is preferred that, the part-of-speech information of positional information and institute predicate of the characteristic information also including institute's predicate.

It is preferred that, characteristic information also includes sentence element information.

The step of for institute's predicate addition characteristic information, further comprise：

For institute's predicate addition sentence element information.

Specifically, can think that institute's predicate adds sentence element information by carrying out syntactic analysis to institute's predicate.

In one embodiment, it is first by the pause information flag of the corresponding word of first training corpus before pause symbol Mark；It is the second mark by the pause information flag of other words.

For example, to " Shanghai GM formally signs the amount of money up to 3.06 hundred million dollars of procurement agreement with AM General a few days ago, is used for Purchase the vehicle and parts of the latter." participle and add pause information result it is as shown in table 1.

The participle of table 1 simultaneously adds pause information

Word	Pause information	Word	Pause information
				Shanghai GM	S	's	S
A few days ago	S	Buying	S
				Formally	S	Agreement	E
With	S	For	S
				The U.S.	S	Buying	S
General Motors	S	The latter	S
				Signature	S	's	S
The amount of money	S	Vehicle	S
				Reach	S	And	S
3.06 hundred million dollars	S	Parts	E

Wherein, S represents that the word is not at the pause of text, and E represents that the word is at the pause of sentence.

More preferably, while considering other characteristic informations of word to train punctuate model, it is contemplated that the spy of the word of utilization Reference breath is more, then the accuracy rate for training the punctuate model come is higher.

In one embodiment, it is the part-of-speech information of positional information and word of the characteristic information of word addition also including word.It is so-called The positional information of word, i.e., the correlation of the position of each above-mentioned word, in the example above, " a few days ago " just " on Sea it is general " and " formal " centre.Of course, it is possible to which the scope as needed to consider positional information, such as can also examine Word before considering " U.S. " is "AND", and word afterwards is " General Motors ", and second word before is " formal ", afterwards second Individual word is " signature ".The scope that positional information is included is wider, then the accuracy rate for training obtained punctuate model to be made pauses in reading unpunctuated ancient writings is got over It is high.

Each word in table 1 has contained the mutual positional information of each word, is selected as needed when training Select the scope of the positional information included.

After part-of-speech information has been added for word, the word segmentation result with pause information and part-of-speech information is as shown in table 2.

The participle of table 2 simultaneously adds pause information and part-of-speech information

Wherein, the part of speech implication that each English alphabetic combination is represented is as follows：

Noun n, time word t, place word s, noun of locality f, number m, measure word q, distinction word b, pronoun r, verb v, adjective A, descriptive word z, adverbial word d, preposition p, conjunction c, auxiliary word u, modal particle y, interjection e, onomatopoeia o, Chinese idiom i, idiom l, abbreviation J, enclitics h, ingredient k, morpheme g, non-morpheme word x, punctuation mark w are followed by, the angle applied from corpus is added proprietary Noun (name nr, place name ns, organization names nt, other proper noun nz).

More preferably, while adding sentence element information for word.Sentence element information include subject, predicate, object, attribute, The adverbial modifier, complement etc..

In one embodiment, think that word adds sentence element information by carrying out syntactic analysis to word.

In one embodiment, it is word addition semantic information, semantic information is determined by the implication of of word itself, and the implication can To be obtained by the textual representation of the word.Different character express might have identical implication, in one embodiment, will have Different literals are stated, but the word for having identical meanings is mapped to one word of identical.

Word-based attribute and the positional information of word are executable step 103, using condition random field algorithm, to training language Expect that corresponding word is trained, to obtain text punctuate model.It is that one word of a word is trained when carrying out model training 's.

When being trained for each word, in order to more efficiently select the scope of the attribute of word and the positional information of word, In one embodiment, using condition random field algorithm, institute corresponding with the feature templates is extracted according to default feature templates Predicate and its characteristic information, to be trained to the training corpus, obtain the text punctuate model, wherein, the spy of extraction Reference breath comprises at least pause information, and the relation that default feature templates are used for the current word for representing and training meets preset requirement Word and its characteristic information.

That is, feature templates define the word and the corresponding characteristic information of word that training is used.In one embodiment, it is special Levying the relation met with the current word of training of template representation includes any one or more of following information combination：The language of current word The combination of the pause information of adopted information and current word；The combination of the part-of-speech information of current word and the pause information of current word；Upper one The semantic information of individual word, the pause information of a upper word, the combination of the pause information of the semantic information of current word and current word；When The semantic information of preceding word, the pause information of current word, the combination of the pause information of the part-of-speech information of next word and next word； The pause information of the part-of-speech information of a upper word, a upper word, the part-of-speech information of a upper word, the pause letter of a upper word The combination of the pause information of breath, the part-of-speech information of current word and current word；The pause of the part-of-speech information of a upper word, a upper word Information, the part-of-speech information of current word, the pause information of current word, the pause information of the part-of-speech information of next word and next word Combination.

Wherein, institute's semantic information is expressed or can be semantic to be believed according to default after participle in itself for the text of current word Breath matching method matches are obtained, and are such as matched default term vector, are matched default thesaurus, match default near synonym storehouse To represent.Institute's semantic information can be by word, term vector, where it default thesaurus, default where it Near synonym storehouse etc. is represented.

Using features described above template, training the characteristic information for the word used not only includes pause information, in addition to institute's predicate Positional information, the part-of-speech information of institute's predicate etc..

That is, a kind of feature templates provide only to consider that the semantic information of current word is trained, another character modules Plate gauge considers the part-of-speech information training of current word surely, and also template is provided while considering the part-of-speech information of a upper word, current word The combination of part-of-speech information of part-of-speech information and next word be trained.Certainly, feature templates not only include above-mentioned row The content of act, various training areas all should be within the scope of the present invention.

The code representation that a kind of feature templates are given below (gives tacit consent to the pause for the word for including extracting in feature templates Information)：

U03:%x [0,0] # current words are semantic

U04:%x [0,1] # current word parts of speech

U05:The combination semantic with current word of the upper word justice of %x [- 1,0]/%x [0,0] #

U06:The semantic combination with next word part of speech of %x [0,0]/%x [1,1] # current words

U20:%x [- 2,1]/%x [- 1,1]/%x [0,1]

The part of speech of the upper word of #, the combination of the part of speech and current word part of speech of a upper word

U24:%x [- 1,1]/%x [0,1]/%x [1,1]

The combination of the part of speech of the part of speech of the upper word of #, the part of speech of current word and next word

Wherein, U03:%x [0,0] code represents to carry out the training of model, U04 with the semantic information of current word:%x The code of [0,1] represents the training of the part-of-speech information progress model with current word, and the implication of other codes is by that analogy.

When carrying out model training, a feature templates can be selected and be trained, multiple features can also be selected simultaneously Template is trained, and the number of the feature templates of utilization is more, it is considered to word and its corresponding characteristic information it is more, model instruction Experienced effect is better, and the punctuate accuracy rate for training the punctuate model come is also higher.

For the accuracy rate of the punctuate model that further improves training, in one embodiment, Fig. 2 is refer to, Fig. 2 is shown The flow chart for the method other side that a kind of text punctuate model of the present invention is set up, methods described includes：

Step 201：Participle is carried out to training corpus, the corresponding word of training corpus is obtained；

Step 202：Characteristic information is added for word, characteristic information includes pause information；

Step 203：Using condition random field algorithm, word-based characteristic information is instructed to the corresponding word of training corpus Practice, to obtain text punctuate model.

Step 204：Test data is made pauses in reading unpunctuated ancient writings with text punctuate model, punctuate result is obtained；

Step 205：Judge whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value；If it is not, then entering Step 206, if then entering step 208.

Step 206：Adjust the characteristic information frequency threshold value parameter and/or fitting parameter of the condition random field algorithm；

Step 207：Judge by training obtained text after characteristic information frequency threshold value parameter and/or fitting parameter adjustment The accuracy rate for the punctuate result that punctuate model is made pauses in reading unpunctuated ancient writings to the training corpus is more than or equal to accuracy rate threshold value, if it is not, then Return to step 206, if then entering step 208；

Step 208：Terminate.

After the punctuate model for obtaining being trained by training corpus, performance test data are come to obtained punctuate model Tested.Namely using the data of known pause information, go to test the performance of punctuate model.

For example, as it is known that " Shanghai GM formally signs the amount of money up to 3.06 hundred million dollars of procurement agreement with AM General a few days ago, uses In the vehicle and parts of buying the latter." punctuate information, will remove punctuate after the sentence, i.e., " Shanghai GM a few days ago formally with AM General signs the amount of money is used for the vehicle and parts for purchasing the latter up to 3.06 hundred million dollars of procurement agreement " as test number According to inputting obtained punctuate model, punctuate model can export the punctuate result to the test data, compare the punctuate result and The punctuate result known, you can judge the accuracy rate of the punctuate model.

Accuracy rate threshold value can be set to judge whether the accuracy rate of punctuate model meets the requirements, when accuracy rate is more than or waits When accuracy rate threshold value, that is, think that the accuracy rate of punctuate model is sufficiently high, this punctuate model can be used to carry out other The punctuate of text.

Certainly, test process can be multiple, and the comprehensive standard for weighing each test is integrated by statistical method True rate, and judge whether the synthesis accuracy rate meets requirement.

If testing obtained accuracy rate is not reaching to accuracy rate threshold value, the characteristic information of regularization condition random field algorithm Number threshold parameter and/or fitting parameter, are trained after by characteristic information frequency threshold value parameter and/or fitting parameter adjustment To the accuracy rate of punctuate result made pauses in reading unpunctuated ancient writings to training corpus of text punctuate model be more than or equal to accuracy rate threshold value, then will Obtained text punctuate model is trained after adjustment as final text punctuate model.

Wherein, characteristic information frequency threshold value parameter refers to the pass each other that the characteristic information of word in training corpus is constituted It is the frequency threshold value occurred.

For example, considering " combination of the part of speech of the part of speech of a upper word, the part of speech of current word and next word ", this is special When levying template and being trained the training of language material, the relation each other that a kind of characteristic information is constituted is " noun adverbial word noun ", The current word for referring to training is adverbial word, and word and latter one word are all the characteristic information of each this word of noun before it Relation.

Fitting parameter refers to the hyper parameter in condition random field algorithm, flat between can adjusting overfitting and not being fitted Weighing apparatus degree, fit parameter values are bigger, and the degree of condition random field algorithm fitting training data is higher.

According to experiment experience, in one embodiment, the feature of regularization condition random field algorithm in 1 to 5 number range Information frequency threshold value parameter, the fitting parameter of regularization condition random field algorithm in 1 to 3 number range.

In one embodiment, punctuate model is tested by the input of voice.Now, first have to enter tone testing data Row speech recognition, converts voice data into text data i.e. speech data text, then, is carried out using the speech data text The test of punctuate model.

More preferably, it is speech data text addition pause symbol, due to being test data, can be added by manual type Pause symbol, the pause can be punctuation mark.

There is punctuation mark, you can add pause information for speech data text, that is, the place for having punctuation mark is to stop The place paused.

Utilize the text with pause information, you can carry out the survey to punctuate model punctuate accuracy rate using preceding method Examination.

If carrying out the test of punctuate model using speech data, the processing procedure of early stage includes：To tone testing data Speech recognition is carried out, speech data text is obtained；Participle is carried out to speech data text, speech data text is obtained corresponding Word；Pause symbol is added for speech data text；It is speech data text addition pause information based on pause symbol.

In one embodiment, foregoing participle operation can be carried out with dictionary for word segmentation.Record big in dictionary for word segmentation The word of amount, the text for needing participle is compareed with each word in dictionary for word segmentation, deposited if being occurred in that in text in dictionary for word segmentation Word, then corresponding word is set as word.

In one embodiment, new word discovery is carried out to training corpus, obtained neologisms is added in dictionary for word segmentation.Work as discovery When there is the word not having in dictionary for word segmentation in training corpus, word can be distinguished by artificial mode or new word discovery method, and Word is added in dictionary for word segmentation.

In one embodiment, when making pauses in reading unpunctuated ancient writings with text punctuate model to test data, comprise the following steps：With Text punctuate model is made pauses in reading unpunctuated ancient writings to test data, obtains multiple primary punctuate results；With the language number made pauses in reading unpunctuated ancient writings by standard The n-gram language models obtained according to training calculate the probability that always forms a complete sentence of each primary punctuate result respectively, by the probability that always forms a complete sentence The corresponding primary punctuate result of highest is used as punctuate result.

Due to condition random field algorithm be based on probability, by its train come punctuate model can be accurate according to punctuate The size of true probability exports multiple punctuate results.

At the beginning of the n-gram language models that the language data training now made pauses in reading unpunctuated ancient writings again by standard is obtained calculate each respectively The probability that always forms a complete sentence of level punctuate result, the corresponding primary punctuate result of the probability highest that will always form a complete sentence is used as punctuate result.So Just by multistage screening, to determine final punctuate result, the accuracy rate of punctuate is improved.

N-gram language models based on it is a kind of it is assumed that i.e. the appearance of n-th word only it is related to above N-1 word, and and its Its any word is all uncorrelated, and the probability of whole sentence is exactly the product of each word probability of occurrence.These probability can be by directly from language N number of word is counted in material while the number of times occurred is obtained.

In one embodiment, the probability that forms a complete sentence that probability is primary each subordinate sentence of punctuate result that always forms a complete sentence of primary punctuate result Product.

Fig. 3 is refer to, Fig. 3 shows the flow chart for the method made pauses in reading unpunctuated ancient writings to speech data.

Step 301：Speech recognition is carried out to speech data, speech data text is generated；

Step 302：By speech data text input punctuate model, multiple primary punctuate results are obtained；

Step 303：By multiple primary punctuate result input n-gram language models, the total of each primary punctuate result is judged Form a complete sentence probability, and the probability highest that will always form a complete sentence primary punctuate result is used as final punctuate result.

Because punctuate model is obtained by condition random field Algorithm for Training, condition random field algorithm is based on probability In algorithm, step 302, by speech data text input punctuate model, you can obtain according to accuracy rate probability from high to low many Individual primary punctuate result.

The present invention also provides a kind of method of punctuate, obtains treat punctuate text first.

Specifically：Obtain and treat punctuate speech data；

Treat that punctuate speech data carries out speech recognition to described, recognition result is treated into punctuate text as described.

Then it will treat in the punctuate model that the foregoing any training of punctuate text input is obtained, you can complete to speech data Punctuate.

The present invention also provides a kind of computer equipment, including memory, processor and stores on a memory and can locate The computer program run on reason device, the method that the foregoing any text punctuate model of computing device is set up.

The present invention also provides be stored with a kind of computer-readable storage medium, the storage medium instruction, the instruction operation The method that any text punctuate model foregoing Shi Zhihang is set up.

Correspondence preceding method, the present invention also provides the device that a kind of text punctuate model is set up, refer to Fig. 4, Fig. 4 is shown The mechanism map for the device that a kind of text punctuate model of one aspect of the invention is set up.

Described device includes：Word-dividing mode 401, for carrying out participle to training corpus, obtains the corresponding word of training corpus； Characteristic information add module 402, for adding characteristic information for word, characteristic information includes pause information；Training module 403, is used In utilizing condition random field algorithm, word-based characteristic information is trained to the corresponding word of training corpus, is broken with obtaining text Sentence model.

Because word is the base unit of text composition, and be not in intermittent information in the middle of a word, word-dividing mode 401 Participle is carried out to training corpus so that the text that need to be made pauses in reading unpunctuated ancient writings is converted into corresponding word.

Characteristic information add module 402 is that word adds characteristic information, and characteristic information includes pause information.Consider first, transport With the existing pause information of the word of training corpus, to carry out the training of punctuate model, that is to say, that finding out has the letter that pauses The pause rule of the corresponding word of training corpus of breath, punctuate model is set up using the pause rule.

More preferably, described device also includes：Test module 404, for being carried out with text punctuate model to test data Punctuate, obtains punctuate result；Accuracy rate judge module 405, for judging whether the accuracy rate of punctuate result is more than or equal to standard True rate threshold value；Parameter adjustment module 406, if being judged as NO for accuracy rate judge module 405, is calculated for regularization condition random field The characteristic information frequency threshold value parameter and/or fitting parameter of method, join until by characteristic information frequency threshold value parameter and/or fitting The accuracy rate for counting the punctuate result for training obtained text punctuate model to make pauses in reading unpunctuated ancient writings training corpus after adjustment is more than or equal to Accuracy rate threshold value, then will train obtained text punctuate model to be used as final text punctuate model after adjustment.

Above-mentioned test data is likely to be tone testing data, in one embodiment, and the test data is surveyed for voice Data are tried, the test module also includes：Sound identification module, for carrying out speech recognition to the tone testing data, is obtained To speech data text；Punctuate module：The speech data text is made pauses in reading unpunctuated ancient writings with the text punctuate model, broken Sentence result.

The tone testing data of identification may not have pause information, in one embodiment, and the accuracy rate judge module is also Including：Pause symbol add module, for adding pause symbol for the speech data text；Pause information add module, is used In based on the pause symbol, being speech data text addition pause information；Computing module, for based on the voice number According to the pause information of text, the accuracy rate of the punctuate result is calculated；Judge module, for judging the accurate of the punctuate result Whether rate is more than or equal to accuracy rate threshold value.

If training corpus is speech data, in one embodiment, the training corpus is voice training language material, the dress Putting also includes：Sound identification module, for carrying out speech recognition to the tone testing data, obtains speech data text；Institute State word-dividing mode to be further used for, participle is carried out to the speech data text, the corresponding word of the speech data text is obtained； The characteristic information add module is further used for, and is the speech data text addition pause symbol；Based on the symbol that pauses Number, it is the speech data text addition pause information.

In one embodiment, the characteristic information also includes：The positional information of institute's predicate and the part-of-speech information of institute's predicate.

In one embodiment, the characteristic information add module is further used for：For institute's predicate addition sentence element information.

In one embodiment, the characteristic information add module is further used for：By carrying out syntactic analysis to institute's predicate Think institute's predicate addition sentence element information.

Sentence element information includes subject, predicate, object, attribute, the adverbial modifier, complement etc..

When being trained for each word, in order to more efficiently select the scope of the attribute of word and the positional information of word, In one embodiment, the training module is further used for：Using condition random field algorithm, extracted according to default feature templates Corresponding with the feature templates institute's predicate and its characteristic information, to be trained to the training corpus, obtain the text Punctuate model, wherein, the relation of presently described word of the default feature templates for representing and training meets the word of preset requirement And its characteristic information.

In one embodiment, the relation met with the current word of training that the feature templates are represented includes following information group Any one or more closed：The combination of the semantic information of current word and the pause information of current word；The part-of-speech information of current word and The combination of the pause information of current word；Pause information, the semantic information of current word of the semantic information of a upper word, a upper word With the combination of the pause information of current word；The semantic information of current word, the pause information of current word, the part-of-speech information of next word With the combination of the pause information of next word；The pause information of the part-of-speech information of a upper word, a upper word, a upper word Part-of-speech information, the pause information of a upper word, the combination of the pause information of the part-of-speech information of current word and current word；Upper one The pause information of the part-of-speech information of word, a upper word, the part-of-speech information of current word, the pause information of current word, next word The combination of the pause information of part-of-speech information and next word.

In one embodiment, pause symbol is included in the training corpus, the pause for identifying the training corpus is believed Breath, the characteristic information add module is further used for：The pause symbol based on the training corpus is institute's predicate addition Pause information.

In one embodiment, the characteristic information add module is further used for：By first before the pause symbol The pause information flag of the corresponding institute's predicate of the training corpus is the first mark；It is by the pause information flag of other predicates Second mark.

In one embodiment, the word-dividing mode, is further used for：The training corpus is divided with dictionary for word segmentation Word.

In one embodiment, described device also includes：New word discovery module, for carrying out neologisms hair to the training corpus It is existing, obtained neologisms are added in the dictionary for word segmentation.

According to experiment experience, in one embodiment, the parameter adjustment module is further used for：In 1 to 5 number range The characteristic information frequency threshold value parameter of the interior adjustment condition random field algorithm, adjusts the condition in 1 to 3 number range The fitting parameter of random field algorithm.

In one embodiment, the test module is further used for：Test data is entered with the text punctuate model Row punctuate, obtains multiple primary punctuate results；The n-gram language moulds obtained with the language data training made pauses in reading unpunctuated ancient writings by standard Type calculates the probability that always forms a complete sentence of each primary punctuate result, the corresponding primary punctuate of the probability highest that will always form a complete sentence respectively As a result as the punctuate result.

In one embodiment, probability the forming a complete sentence for primary each subordinate sentence of punctuate result of always forming a complete sentence of the primary punctuate result The product of probability.

The specific implementation and technique effect for the device that a kind of above-mentioned text punctuate model is set up, are referred to above-mentioned one The embodiment for the method that text punctuate model is set up is planted, be will not be repeated here.

The present invention also provides a kind of device of punctuate, and described device includes：Text acquisition module, punctuate text is treated for obtaining This；Punctuate module, for being treated described in punctuate text input text punctuate model, obtains punctuate result, wherein the text Punctuate model is that the method training set up using a kind of foregoing text punctuate model is obtained.

In one embodiment, the text acquisition module, including：Voice acquiring unit, punctuate voice number is treated for obtaining According to；Voice recognition unit, for it is described treat punctuate speech data carry out speech recognition, using recognition result as treat punctuate text This.

The present invention also provides a kind of computer equipment, including memory, processor and stores on a memory and can locate The computer program run on reason device, a kind of foregoing punctuate method of the computing device.

The present invention also provides be stored with a kind of computer-readable storage medium, the storage medium instruction, the instruction operation A kind of punctuate method foregoing Shi Zhihang.

Those skilled in the art will further appreciate that, the various illustratives described with reference to the embodiments described herein Logic plate, module, circuit and algorithm steps can be realized as electronic hardware, computer software or combination of the two.To be clear Explain to Chu this interchangeability of hardware and software, various illustrative components, frame, module, circuit and step be above with Its functional form makees vague generalization description.Such feature be implemented as hardware or software depend on concrete application and Put on the design constraint of total system.Technical staff can be realized described for every kind of application-specific with different modes Feature, but such realize that decision-making should not be interpreted to cause departing from the scope of the present invention.

With reference to presently disclosed embodiment describe various illustrative logic modules and circuit can with general processor, Digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other FPGAs Device, discrete door or transistor logic, discrete nextport hardware component NextPort or its be designed to carry out any group of function described herein Close to realize or perform.General processor can be microprocessor, but in alternative, the processor can be any routine Processor, controller, microcontroller or state machine.Processor is also implemented as the combination of computing device, such as DSP One or more microprocessors cooperated with the combination of microprocessor, multi-microprocessor, with DSP core or any other this Class is configured.

It can be embodied directly in hardware, in by processor with reference to the step of the method or algorithm that embodiment disclosed herein is described Embodied in the software module of execution or in combination of the two.Software module can reside in RAM memory, flash memory, ROM and deposit Reservoir, eprom memory, eeprom memory, register, hard disk, removable disk, CD-ROM or known in the art appoint In the storage medium of what other forms.Exemplary storage medium is coupled to processor to enable the processor from/to the storage Medium is read and write-in information.In alternative, storage medium can be integrated into processor.Processor and storage medium can Reside in ASIC.ASIC can reside in user terminal.In alternative, processor and storage medium can be used as discrete sets Part is resident in the user terminal.

In one or more exemplary embodiments, described function can be in hardware, software, firmware or its any combinations It is middle to realize.If being embodied as computer program product in software, each function can be used as the instruction of one or more bars or generation Code storage is transmitted on a computer-readable medium or by it.Computer-readable medium includes computer-readable storage medium and communication Both media, it includes any medium for facilitating computer program to shift from one place to another.Storage medium can be can quilt Any usable medium that computer is accessed.It is non-limiting as example, such computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage, disk storage or other magnetic storage apparatus can be used to carry or store instruction Or the desirable program code and any other medium that can be accessed by a computer of data structure form.Any connection is also by by rights Referred to as computer-readable medium.If for example, software is to use coaxial cable, fiber optic cables, twisted-pair feeder, digital subscriber line Or the wireless technology of such as infrared, radio and microwave etc is passed from web site, server or other remote sources (DSL) Send, then the coaxial cable, fiber optic cables, twisted-pair feeder, DSL or such as infrared, radio and microwave etc is wireless Technology is just included among the definition of medium.Disk (disk) and dish (disc) as used herein include compact disc (CD), laser disc, laser disc, digital versatile disc (DVD), floppy disk and blu-ray disc, which disk (disk) are often reproduced in the way of magnetic Data, and dish (disc) laser reproduce data optically.Combinations of the above should also be included in computer-readable medium In the range of.

It is for so that any person skilled in the art all can make or use this public affairs to provide of this disclosure be previously described Open.Various modifications of this disclosure all will be apparent for a person skilled in the art, and as defined herein general Suitable principle can be applied to spirit or scope of other variants without departing from the disclosure.Thus, the disclosure is not intended to be limited Due to example described herein and design, but it should be awarded and principle disclosed herein and novel features phase one The widest scope of cause.

Claims

1. a kind of method that text punctuate model is set up, it is characterised in that methods described includes：

Participle is carried out to training corpus, the corresponding word of the training corpus is obtained；

For institute's predicate addition characteristic information, the characteristic information includes pause information；

Using condition random field algorithm, the characteristic information based on institute's predicate is trained to the corresponding word of the training corpus, with Obtain the text punctuate model.

2. the method as described in claim 1, it is characterised in that methods described also includes：

Test data is made pauses in reading unpunctuated ancient writings with the text punctuate model, punctuate result is obtained；

Judge whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value；

If it is not, the characteristic information frequency threshold value parameter and/or fitting parameter of the condition random field algorithm are then adjusted, until passing through Obtained text punctuate model is trained to enter the training corpus after characteristic information frequency threshold value parameter and/or fitting parameter adjustment The accuracy rate of the punctuate result of row punctuate is more than or equal to accuracy rate threshold value, then obtained text will be trained to make pauses in reading unpunctuated ancient writings after the adjustment Model is used as final text punctuate model.

3. method as claimed in claim 2, it is characterised in that the test data is tone testing data, the utilization institute The step of text punctuate model is made pauses in reading unpunctuated ancient writings to test data is stated, is further comprised：

Speech recognition is carried out to the tone testing data, speech data text is obtained；

The speech data text is made pauses in reading unpunctuated ancient writings with the text punctuate model, punctuate result is obtained.

4. method as claimed in claim 3, it is characterised in that the accuracy rate for judging the punctuate result whether be more than or The step of equal to accuracy rate threshold value, further comprises：

Pause symbol is added for the speech data text；

It is the speech data text addition pause information based on the pause symbol；

Based on the pause information of the speech data text, the accuracy rate of the punctuate result is calculated；

Judge whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value.

5. the method as described in claim 1, it is characterised in that the training corpus is voice training language material, and methods described is also Including：

The step of participle, further comprise：

Participle is carried out to the speech data text, the corresponding word of the speech data text is obtained；

The step of addition characteristic information, in addition to：

Pause symbol is added for the speech data text；

It is the speech data text addition pause information based on the pause symbol.

6. the method as described in claim 1, it is characterised in that the characteristic information also includes：The positional information of institute's predicate and The part-of-speech information of institute's predicate.

7. the method as described in claim 1, it is characterised in that the step of the predicate addition characteristic information for institute, further Including：

For institute's predicate addition sentence element information.

8. method as claimed in claim 7, it is characterised in that

Think that institute's predicate adds sentence element information by carrying out syntactic analysis to institute's predicate.

9. the method as described in claim 1, it is characterised in that further comprise the step of the training：

Using condition random field algorithm, institute's predicate corresponding with the feature templates and its spy are extracted according to default feature templates Reference ceases, and to be trained to the training corpus, obtains the text punctuate model, wherein, the characteristic information of extraction is at least Comprising the pause information, default feature templates are for representing that the relation of the current word with being trained in institute's predicate meets default want The word and its characteristic information asked.

10. method as claimed in claim 9, it is characterised in that

The relation met with the current word of training that the feature templates are represented includes any one or more of following information combination： The semantic information of current word, the pause information of current word；The part-of-speech information of current word and the pause information of current word；A upper word Semantic information, the pause information of a upper word, the pause information of the semantic information of current word and current word；The semanteme of current word Information, the pause information of current word, the pause information of the part-of-speech information of next word and next word；The part of speech of a upper word The pause information of information, a upper word, the part-of-speech information of a upper word, pause information, the part of speech of current word of a upper word The pause information of information and current word；The pause information of the part-of-speech information of a upper word, a upper word, the part of speech letter of current word The pause information of breath, the pause information of current word, the part-of-speech information of next word and next word.

11. the method as described in claim 1, it is characterised in that pause symbol is included in the training corpus, for identifying The step of stating the pause information of training corpus, the predicate addition pause information for institute includes：

The pause symbol based on the training corpus is institute's predicate addition pause information.

12. method as claimed in claim 11, it is characterised in that the step of adding pause information for institute's predicate is further wrapped Include：

It is the first mark by the pause information flag of the corresponding institute's predicate of first before the pause symbol training corpus；

It is the second mark by the pause information flag of other predicates.

13. the method as described in claim 1, it is characterised in that described the step of carry out participle to training corpus, further bag Include：

Participle is carried out to the training corpus with dictionary for word segmentation.

14. method as claimed in claim 13, it is characterised in that methods described also includes：

New word discovery is carried out to the training corpus, obtained neologisms are added in the dictionary for word segmentation.

15. method as claimed in claim 2, it is characterised in that the characteristic information of the adjustment condition random field algorithm The step of frequency threshold value parameter and/or fitting parameter, further comprise：

The characteristic information frequency threshold value parameter of the condition random field algorithm is adjusted in 1 to 5 number range, in 1 to 3 number The fitting parameter of the condition random field algorithm is adjusted in the range of value.

16. method as claimed in claim 2, it is characterised in that described to enter with the text punctuate model to test data The step of row punctuate, further comprise：

Test data is made pauses in reading unpunctuated ancient writings with the text punctuate model, multiple primary punctuate results are obtained；

The n-gram language models obtained with the language data training made pauses in reading unpunctuated ancient writings by standard calculate each described primary respectively The probability that always forms a complete sentence of punctuate result, the corresponding primary punctuate result of the probability highest that will always form a complete sentence is used as the punctuate result.

17. method as claimed in claim 16, it is characterised in that the probability that always forms a complete sentence of the primary punctuate result is the primary The product of the probability that forms a complete sentence of each subordinate sentence of punctuate result.

18. a kind of method of punctuate, it is characterised in that methods described includes：

Obtain and treat punctuate text；

Treated described in punctuate text input text punctuate model, obtain punctuate result, wherein the text punctuate model is to adopt The method training set up with a kind of text punctuate model as described in any one of claim 1 to 17 is obtained.

19. method as claimed in claim 18, it is characterised in that described to obtain the step for the treatment of punctuate text and further comprise：

Obtain and treat punctuate speech data；

20. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor A kind of calculation machine program, it is characterised in that text punctuate mould of the computing device as any one of claim 1 to 17 The method that type is set up.

21. be stored with instruction in a kind of computer-readable storage medium, the storage medium, it is characterised in that during the instruction operation Perform the method that a kind of text punctuate model as any one of claim 1 to 17 is set up.

22. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor A kind of calculation machine program, it is characterised in that the method for punctuate of the computing device as described in claim 18 or 19.

23. be stored with instruction in a kind of computer-readable storage medium, the storage medium, it is characterised in that during the instruction operation Perform a kind of punctuate method as described in claim 18 or 19.

24. the device that a kind of text punctuate model is set up, it is characterised in that described device includes：

Word-dividing mode, for carrying out participle to training corpus, obtains the corresponding word of the training corpus；

Characteristic information add module, for adding characteristic information for institute's predicate, the characteristic information includes pause information；

Training module, for utilizing condition random field algorithm, the characteristic information based on institute's predicate is corresponding to the training corpus Word is trained, to obtain the text punctuate model.

25. device as claimed in claim 24, it is characterised in that described device also includes：

Test module, for making pauses in reading unpunctuated ancient writings with the text punctuate model to test data, obtains punctuate result；

Accuracy rate judge module, for judging whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value；

Parameter adjustment module, for when the accuracy rate is less than accuracy rate threshold value, adjusting the spy of the condition random field algorithm Reference ceases frequency threshold value parameter and/or fitting parameter, is adjusted until by characteristic information frequency threshold value parameter and/or fitting parameter The accuracy rate for the punctuate result that obtained text punctuate model made pauses in reading unpunctuated ancient writings to the training corpus is trained to be more than or equal to standard afterwards True rate threshold value, then will train obtained text punctuate model to be used as final text punctuate model after the adjustment.

26. device as claimed in claim 25, it is characterised in that the test data is tone testing data, the test Module also includes：

Sound identification module, for carrying out speech recognition to the tone testing data, obtains speech data text；

Punctuate module：The speech data text is made pauses in reading unpunctuated ancient writings with the text punctuate model, punctuate result is obtained.

27. device as claimed in claim 26, it is characterised in that the accuracy rate judge module also includes：

Pause symbol add module, for adding pause symbol for the speech data text；

Pause information add module, for based on the pause symbol, being the speech data text addition pause information；

Computing module, for the pause information based on the speech data text, calculates the accuracy rate of the punctuate result；

Judge module, for judging whether the accuracy rate of the punctuate result is more than or equal to accuracy rate threshold value.

28. device as claimed in claim 24, it is characterised in that the training corpus is voice training language material, described device Also include：

The word-dividing mode is further used for, and carries out participle to the speech data text, obtains the speech data text pair The word answered；

The characteristic information add module is further used for, and is the speech data text addition pause symbol；Stopped based on described Pause symbol, is the speech data text addition pause information.

29. device as claimed in claim 24, it is characterised in that the characteristic information also includes：The positional information of institute's predicate With the part-of-speech information of institute predicate.

30. device as claimed in claim 24, it is characterised in that the characteristic information add module is further used for：

For institute's predicate addition sentence element information.

31. device as claimed in claim 30, it is characterised in that the characteristic information add module is further used for：

32. device as claimed in claim 24, it is characterised in that the training module is further used for：

Using condition random field algorithm, institute's predicate corresponding with the feature templates and its spy are extracted according to default feature templates Reference ceases, and to be trained to the training corpus, obtains the text punctuate model, wherein, default feature templates are used for Represent that the relation with the presently described word of training meets the word and its characteristic information of preset requirement.

33. device as claimed in claim 32, it is characterised in that

The relation met with the current word of training that the feature templates are represented includes any one or more of following information combination： The semantic information of current word and the pause information of current word；The part-of-speech information of current word and the pause information of current word；Upper one The semantic information of word, the pause information of a upper word, the combination of the pause information of the semantic information of current word and current word；Currently The semantic information of word, the pause information of current word, the pause information of the part-of-speech information of next word and next word；Upper one The pause information of the part-of-speech information of word, a upper word, it is the part-of-speech information of a upper word, the pause information of a upper word, current The part-of-speech information of word and the pause information of current word；The pause information of the part-of-speech information of a upper word, a upper word, current word Part-of-speech information, the pause information of current word, the pause information of the part-of-speech information of next word and next word.

34. device as claimed in claim 24, it is characterised in that pause symbol is included in the training corpus, for identifying The pause information of the training corpus, the characteristic information add module is further used for：Based on described in the training corpus Pause symbol is institute's predicate addition pause information.

35. device as claimed in claim 34, it is characterised in that the characteristic information add module is further used for：

It is the second mark by the pause information flag of other predicates.

36. device as claimed in claim 24, it is characterised in that the word-dividing mode, is further used for：

37. device as claimed in claim 36, it is characterised in that described device also includes：

Obtained neologisms, for carrying out new word discovery to the training corpus, are added the dictionary for word segmentation by new word discovery module In.

38. device as claimed in claim 25, it is characterised in that the parameter adjustment module is further used for：

39. device as claimed in claim 25, it is characterised in that the test module is further used for：

40. device as claimed in claim 39, it is characterised in that the probability that always forms a complete sentence of the primary punctuate result is the primary The product of the probability that forms a complete sentence of each subordinate sentence of punctuate result.

41. a kind of device of punctuate, it is characterised in that described device includes：

Text acquisition module, punctuate text is treated for obtaining；

Punctuate module, for being treated described in punctuate text input text punctuate model, obtains punctuate result, wherein the text Punctuate model is that the method training set up using a kind of text punctuate model as described in any one of claim 1 to 17 is obtained 's.

42. device as claimed in claim 41, it is characterised in that the text acquisition module, including：

Voice acquiring unit, punctuate speech data is treated for obtaining；

Voice recognition unit, for it is described treat punctuate speech data carry out speech recognition, using recognition result as treat punctuate text This.