CN108628813A

CN108628813A - Treating method and apparatus, the device for processing

Info

Publication number: CN108628813A
Application number: CN201710162165.2A
Authority: CN
Inventors: 郑宏
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2018-10-09
Anticipated expiration: 2037-03-17
Also published as: CN108628813B

Abstract

An embodiment of the present invention provides a kind for the treatment of method and apparatus and a kind of device for processing, method therein specifically includes：Obtain pending text；The pending text is segmented, to obtain the corresponding global word sequence of the pending text；Punctuate addition processing is carried out to the global word sequence, result is added to obtain the corresponding optimal punctuate of the pending text；Wherein, target punctuation mark is added in punctuate addition processing in the global word sequence between adjacent word, the corresponding probabilistic language model of the optimal punctuate addition result is optimal, and the optimal punctuate addition result includes：At least one semantic segment, the semantic segment include：The continuous word of the overall situation word sequence, and/or, it is added with the continuous word of punctuation mark；Export the optimal punctuate addition result.The embodiment of the present invention can improve the accuracy of addition punctuate.

Description

Treating method and apparatus, the device for processing

Technical field

The present invention relates to technical field of information processing, are used for more particularly to a kind for the treatment of method and apparatus and one kind The device of processing.

Background technology

In the technical field of information processing such as the communications field and internet arena, needed for some in certain application scenarios The file addition punctuate for lacking punctuate, for example, adding punctuate etc. for the corresponding text of voice recognition result.

Existing scheme can be that the corresponding text of voice recognition result adds punctuate according to the mute interval of voice signal. Specifically, the threshold value of mute length can be set first, if the length of mute interval when spoken user is spoken in voice signal Degree is more than the threshold value, then adds punctuate on corresponding position；, whereas if mute when spoken user is spoken in voice signal The length at interval is more than the threshold value, then does not add punctuate.

However, inventor has found during realizing the embodiment of the present invention, different spoken users often have different Word speed is that the corresponding text of voice recognition result adds punctuate in this way, according to the mute interval of voice signal in existing scheme, It will influence the accuracy of addition punctuate.For example, if the word speed of spoken user is too fast, it is not spaced or is spaced between sentence It is very short so that being less than threshold value, then will not be that text adds any punctuate；For another example, it if the word speed of spoken user is excessively slow, approaches The case where one word one times, then text will be corresponding with many punctuates；Above-mentioned two situations can cause punctuate to add mistake, That is the accuracy of addition punctuate is relatively low.

Invention content

In view of the above problems, it is proposed that the embodiment of the present invention overcoming the above problem or at least partly in order to provide one kind Processing method, processing unit and the device for processing to solve the above problems, the embodiment of the present invention can improve addition punctuate Accuracy.

To solve the above-mentioned problems, the invention discloses a kind of processing methods, including：

Obtain pending text；

The pending text is segmented, to obtain the corresponding global word sequence of the pending text；

Punctuate addition processing is carried out to the global word sequence, is added with obtaining the corresponding optimal punctuate of the pending text Add result；Wherein, the punctuate addition processing adds target punctuation mark in the global word sequence between adjacent word, described The corresponding probabilistic language model of optimal punctuate addition result is optimal, and the optimal punctuate addition result includes：At least one semanteme Segment, the semantic segment include：The continuous word of the overall situation word sequence, and/or, it is added with the continuous word of punctuation mark；

Export the optimal punctuate addition result.

On the other hand, the invention discloses a kind of processing units, including：

Pending text acquisition module, for obtaining pending text；

Word-dividing mode, for being segmented to the pending text, to obtain the corresponding overall situation of the pending text Word sequence；

Punctuate adds processing module, for carrying out punctuate addition processing to the global word sequence, to obtain described waiting locating It manages the corresponding optimal punctuate of text and adds result；Wherein, punctuate addition processing in the global word sequence adjacent word it Between add target punctuation mark, the corresponding probabilistic language model of the optimal punctuate addition result is optimal, and the optimal punctuate adds The result is added to include：At least one semantic segment, the semantic segment include：The continuous word of the overall situation word sequence, and/or, add Added with the continuous word of punctuation mark；And

As a result output module adds result for exporting the optimal punctuate.

Optionally, the punctuate addition processing module includes：

Dynamic Programming handles submodule, and for utilizing dynamic programming algorithm, punctuate addition is carried out to the global word sequence Processing adds result to obtain the corresponding optimal punctuate of the pending text.

Optionally, the Dynamic Programming processing submodule includes：

Gather acquiring unit, for obtaining the corresponding word sequence set of the global word sequence；

First recursion unit passes through recursion mode for the sequence of the subset according to the word sequence set from small to large Determine that each subset corresponds to the target punctuation mark of optimal subset punctuate addition result；The optimal subset punctuate addition result corresponds to Probabilistic language model it is optimal；

First optimal result acquiring unit, for corresponding to the addition of optimal subset punctuate according to the subset of the word sequence set As a result, obtaining the corresponding optimal punctuate addition result of the pending text.

Optionally, the subset of the continuous word sequence set includes：Preceding i continuous words of the pending text, 0<i≤ The word quantity M that the pending text includes, then the first recursion unit include：

Subelement is added, for corresponding to the target punctuation mark that optimal subset punctuate adds result according to preceding k continuous words, Punctuation mark is added between adjacent word in the preceding i continuous words, to obtain the preceding i corresponding at least one of continuous word Subset punctuate adds paths；Wherein, 0<k<I, k are positive integer；

First language model probability determination subelement determines the subset punctuate for utilizing neural network language model It adds paths and corresponds to the probabilistic language model of the first semantic segment；

First choice subelement, for the probabilistic language model according to first semantic segment, from at least one The subset punctuate optimal optimal subset punctuate of middle selection probabilistic language model that adds paths adds paths；

Target punctuation mark obtains subelement, for the punctuate symbol for including that adds paths according to the optimal subset punctuate Number, obtain the target punctuation mark that the preceding i continuous words correspond to optimal subset punctuate addition result.

Optionally, the Dynamic Programming processing submodule includes：

Global path acquiring unit, for adding punctuation mark between adjacent word in the global word sequence, to obtain The corresponding global punctuate of the overall situation word sequence adds paths；

Mobile acquiring unit, for according to vertical sequence, road to be added from the global punctuate by move mode Local punctuate is obtained in diameter to add paths and its corresponding second semantic segment；Wherein, the included word of different second semantic segments The quantity for according with unit is identical, and the second adjacent semantic segment has the character cell repeated, and the character cell includes：Word and/ Or punctuation mark；

Second recursion unit, for according to vertical sequence, the semantic piece of optimal second to be determined by recursion mode The corresponding target punctuation mark of section；The optimal corresponding probabilistic language model of the second semantic segment is optimal；

Second optimal result acquiring unit, for according to the corresponding target punctuate symbol of each second optimal semantic segment Number, obtain the corresponding optimal punctuate addition result of the pending text.

Optionally, the second recursion unit includes：

Second language model probability determination subelement, for utilizing N-gram language model and/or neural network language mould Type determines the corresponding probabilistic language model of current second semantic segment；

Second selection subelement, for according to the corresponding probabilistic language model of current second semantic segment, from a variety of Optimal current second semantic segment is selected in the second current semantic segment；

Target punctuation mark determination subelement, the punctuation mark for including by optimal current second semantic segment As the optimal corresponding target punctuation mark of current second semantic segment；

Second semantic segment determination subelement, for according to the optimal corresponding target punctuate symbol of current second semantic segment Number, obtain next second semantic segment.

Optionally, the second optimal result acquiring unit includes：

Add subelement, for according to from back to front sequence or vertical sequence, according to described each optimal The corresponding target punctuation mark of second semantic segment adds punctuation mark to the global word sequence, described pending to obtain The corresponding optimal punctuate of text adds result.

Optionally, the punctuate addition processing module includes：

As a result exhaustive submodule adds result for obtaining the corresponding a variety of punctuates of the global word sequence；

Probabilistic language model determination sub-module, for determining the corresponding probabilistic language model of the punctuate addition result；With And

As a result submodule is selected, for selecting language mould from the corresponding a variety of punctuates addition results of the overall situation word sequence The optimal punctuate addition of type probability as the corresponding optimal punctuate of the pending text as a result, add result.

Include memory and one or one in another aspect, the invention discloses a kind of device for processing Above program, one of them either more than one program be stored in memory and be configured to by one or one with It includes the instruction for being operated below that upper processor, which executes the one or more programs,：

Obtain pending text；

Export the optimal punctuate addition result.

The embodiment of the present invention includes following advantages：

The embodiment of the present invention is handled in the corresponding global word sequence of pending text by punctuate addition between adjacent word Target punctuation mark is added, and general by the corresponding language model of optimal punctuate addition result that above-mentioned punctuate addition is handled Rate is optimal, and the optimal punctuate addition result may include：At least one semantic segment, above-mentioned semantic segment may include：Institute The continuous word of global word sequence is stated, and/or, it is added with the continuous word of punctuation mark；Due to the optimal punctuate of the embodiment of the present invention Addition result can realize the global optimum of probabilistic language model, herein globally available in indicating that pending text corresponds to punctuate The corresponding entirety of result is added, therefore the optimal punctuate addition result of the embodiment of the present invention can improve the accurate of addition punctuate Degree.

Description of the drawings

Fig. 1 is a kind of step flow chart of processing method embodiment of the present invention；

Fig. 2 is the schematic diagram that a kind of pending text of the embodiment of the present invention corresponds to the path planning of global word sequence；

Fig. 3 is a kind of structure diagram of processing unit embodiment of the present invention；

Fig. 4 be a kind of device for information processing shown according to an exemplary embodiment as terminal when block diagram； And

Fig. 5 be a kind of device for information processing shown according to an exemplary embodiment as server when frame Figure.

Specific implementation mode

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

An embodiment of the present invention provides a kind of processing scheme, the program can segment pending text, to obtain The corresponding global word sequence of the pending text carries out punctuate addition processing, to obtain described wait for the global word sequence The corresponding optimal punctuate addition of processing text is as a result, and export the optimal punctuate addition result；Due to the embodiment of the present invention Above-mentioned punctuate addition processing adds target punctuation mark in the global word sequence between adjacent word, and target punctuation mark is available The best candidate punctuation mark being added between indicating adjacent word, the embodiment of the present invention handle to obtain by above-mentioned punctuate addition The corresponding probabilistic language model of optimal punctuate addition result it is optimal, the optimal punctuate addition result may include：At least one A semantic segment, above-mentioned semantic segment may include：The continuous word of the overall situation word sequence, and/or, added with punctuation mark Continuous word, it is general that all semantic segments that above-mentioned probabilistic language model can include for a kind of punctuate addition result correspond to language model The synthesis of rate；Since the optimal punctuate addition result of the embodiment of the present invention can realize the global optimum of probabilistic language model, this The globally available of place corresponds to the corresponding entirety of punctuate addition result in the pending text of expression, therefore the embodiment of the present invention is optimal Punctuate addition result can improve the accuracy of addition punctuate.

Embodiment of the method

Referring to Fig.1, the step flow chart for showing a kind of processing method embodiment of the present invention, can specifically include as follows Step：

Step 101 obtains pending text；

Step 102 segments the pending text, to obtain the corresponding global word order of the pending text Row；

Step 103 carries out punctuate addition processing to the global word sequence, corresponding most to obtain the pending text Excellent punctuate adds result；Wherein, the punctuate addition processing adds target punctuate in the global word sequence between adjacent word Symbol, the corresponding probabilistic language model of the optimal punctuate addition result is optimal, and the optimal punctuate addition result may include： At least one semantic segment, the semantic segment may include：The continuous word of the overall situation word sequence, and/or, it is added with punctuate The continuous word of symbol；

Step 104, the output optimal punctuate add result.

The embodiment of the present invention can be applied to need the arbitrary applied field of addition punctuate in speech recognition, machine translation etc. Scape, it will be understood that the embodiment of the present invention does not limit specific application scenarios.For example, in the applied field of speech recognition Under scape,

Processing method provided in an embodiment of the present invention can be applied to the application environment of the computing devices such as terminal or server In.Optionally, above-mentioned terminal can include but is not limited to：Smart mobile phone, tablet computer, pocket computer on knee, vehicle mounted electric Brain, desktop computer, intelligent TV set, wearable device etc..Above-mentioned server can be Cloud Server or generic services Device, the processing service for providing pending text to client.

Processing method provided in an embodiment of the present invention is applicable to the processing processing of the language such as Chinese, Japanese, Korean, is used for Improve the accuracy of addition punctuate.It is appreciated that arbitrarily needing the language for being added punctuate in the embodiment of the present invention In the scope of application of processing method.

In the embodiment of the present invention, the text that pending text can be used for indicating to carry out processing processing, the pending text Originally text or voice that user is inputted by computing device can be derived from, other computing devices are can be from.It needs It is bright, may include in above-mentioned pending text：A kind of language or more than one language, for example, above-mentioned pending text It may include Chinese in this, can also include the Chinese mixing with other for example English language, the embodiment of the present invention is to specific Pending text do not limit.

In practical applications, the computing device of the embodiment of the present invention can by client end AP P (application, Application the process flow of the embodiment of the present invention) is executed, client application may operate on computing device, example Such as, which can be the arbitrary APP that runs in terminal, then the client application can be answered from other of computing device With the pending text of acquisition.Alternatively, the computing device of the embodiment of the present invention can be executed by the functional device of client application The process flow of the embodiment of the present invention, then the functional device can be from the pending text of other functional devices acquisition.Alternatively, The computing device of the embodiment of the present invention can execute the processing method of the embodiment of the present invention as server.

In a kind of alternative embodiment of the present invention, step 101 can obtain according to the voice signal of spoken user and wait locating Text is managed, in such cases, the voice signal of spoken user can be converted to text message by step 101, and be believed from the text Pending text is obtained in breath.Alternatively, the voice signal that step 101 can directly receive user from speech recognition equipment is corresponding Text message, and obtain pending text from from text information.In practical applications, spoken user may include：In unison It talks in the scene of translation and sends out the user of voice signal, and/or generate the user etc. of voice signal by terminal, then may be used To receive the voice signal of spoken user by microphone or other voice collecting devices.

It is alternatively possible to which the voice signal of spoken user is converted to text message using speech recognition technology.If will The voice signal of user's spoken user is denoted as S, and corresponding phonetic feature sequence O is obtained after carrying out a series of processing to S, It is denoted as O={ O₁, O₂..., O_i..., O_T, wherein O_iIt is i-th of phonetic feature, T is phonetic feature total number.S pairs of voice signal The sentence answered is considered as a word string being made of many words, is denoted as W={ w₁, w₂..., w_n}.The process of speech recognition is exactly According to known phonetic feature sequence O, most probable word string W is found out.

Specifically, speech recognition is the process of a Model Matching, in this process, can be first according to the language of people Sound feature establishes speech model, by the analysis of the voice signal to input, extracts required feature, to establish speech recognition institute The template needed；The process that voice inputted to user is identified is by the feature of the inputted voice of user and the template ratio Compared with process, finally determine with the optimal Template of the inputted voice match of the user, to obtain the result of speech recognition.Tool The speech recognition algorithm of body can be used training and the recognizer of the hidden Markov model based on statistics, base can also be used In the training of neural network and recognizer, based on the matched recognizer of dynamic time consolidation etc. other algorithms, the present invention Embodiment does not limit specific speech recognition process.

In another alternative embodiment of the present invention, step 101 can obtain pending according to text input by user Text.For example, the text that is inputted under the scenes such as instant messaging, office documents of user may not include punctuation mark or comprising Punctuation mark it is less, therefore can be as the source of pending text.

In practical applications, step 101 can be according to practical application request, from the corresponding text of voice signal or user Pending text is obtained in the text of input.It is alternatively possible to which the interval time according to voice signal S, corresponds to from voice signal S Text in obtain pending text；For example, when the interval time of voice signal S being more than time threshold, when can be according to this Between put and determine corresponding first separation, using the corresponding texts of voice signal S before first separation as pending text This, and the corresponding texts of voice signal S after first separation are handled, to continue therefrom to obtain pending text This.Alternatively, it is alternatively possible to according to the number of words that the corresponding text of voice signal or text input by user are included, from language Pending text is obtained in the corresponding text of sound signal or text input by user；For example, in the corresponding text of voice signal Or text input by user include number of words be more than number of words threshold value when, corresponding second point can be determined according to the number of words threshold value Boundary's point, can using the corresponding texts of voice signal S before second separation as pending text, and to this second boundary The corresponding texts of voice signal S after point are handled, to continue therefrom to obtain pending text.It is appreciated that the present invention Embodiment from the corresponding text of voice signal or text input by user for obtaining the detailed process of pending text not It limits.

Language material used in training due to language model is usually the language material by segmenting, therefore in order to obtain optimal punctuate The corresponding probabilistic language model of semantic segment included by result is added, the embodiment of the present invention can be waited for by step 102 above-mentioned Processing text is segmented, to obtain the corresponding global word sequence of above-mentioned pending text.

So-called participle is exactly by text dividing into individual word one by one, is by continuous text according to certain rule Model is reassembled into the process of global word sequence.By taking Chinese words segmentation as an example, the target of participle technique is exactly by text dividing For individual Chinese word one by one.And be individual word by sentence cutting, it is realize machine recognition human language first Step, therefore participle technique is widely used in the nature such as literary periodicals, machine translation, speech recognition, text snippet, text retrieval In the application branch of Language Processing.

In the embodiment of the present invention, step 102 segments the pending text, and the segmenting method that may be used is specific May include：Segmenting method based on string matching, the segmenting method based on understanding and the segmenting method etc. based on statistics, can To understand, the embodiment of the present invention does not limit the detailed process segmented to the pending text.In the present invention A kind of application example in, pending text is " it is that Nice to see you by Xiao Ming that you, which get well me, ", then its corresponding global word sequence May include：" hello/I be/Xiao Ming/be very glad/recognize you ".

It should be noted that the process that is segmented to the pending text of step 102 and speech recognition process can be with For mutually independent process, the process that step 102 segments the pending text can not be by speech recognition process It influences, for example, step 102 can sentence W progress word segmentation processing corresponding to voice signal S.

In a kind of alternative embodiment of the present invention, the method for the embodiment of the present invention can also include：Step 101 is obtained At least one pending text write-in buffer area taken；Then step 102 can read pending text from the buffer area first, And read pending text is segmented.It is alternatively possible to establish such as queue, number in the memory field of computing device As above-mentioned buffer area, the embodiment of the present invention does not limit specific buffer area for group or the data structure of chained list.On The treatment effeciency of pending text can be improved in such a way that buffer area stores pending text by stating, it will be understood that use magnetic The mode of the pending text of disk storage is also feasible, and the embodiment of the present invention is not added with the specific storage mode of pending text With limitation.

In the embodiment of the present invention, the pending text correspond to can be added between adjacent word in global word sequence it is corresponding A variety of candidate's punctuation marks are added that is, can be corresponded to according to the pending text in global word sequence between adjacent word The situation of a variety of candidate's punctuation marks carries out punctuate addition processing, in this way, a pending text pair to the pending text Answer global word sequence will it is corresponding there are many punctuate addition scheme and its addition of corresponding punctuate as a result, the embodiment of the present invention is final To be that the optimal optimal punctuate of probabilistic language model adds result.Wherein, above-mentioned probabilistic language model can be (arbitrary) one All semantic segments that kind punctuate addition result includes correspond to the synthesis of probabilistic language model.

In natural language processing field, language model is the probabilistic model established for a kind of language or multilingual, Purpose is to establish the distribution for the probability that one can describe appearance of the given global word sequence in language.Specific to of the invention real Example is applied, the distribution of the probability of appearance of the given global word sequence that can describe language model in language is known as language model Probability, also, the given global word sequence of language model description can carry punctuation mark.It is alternatively possible to from corpus Language material sentence is obtained, which is segmented, and according to the global word sequence comprising punctuation mark, training obtains above-mentioned Language model.Such as " I likes dog, and dog plays ball." corresponding global word sequence can be：" I/like/dog/,/dog/object for appreciation/ Ball/.", it will be understood that specific overall situation word sequence used by training of the embodiment of the present invention for language model is not limited System.

In the embodiment of the present invention, language model may include：N-gram (N-gram) language model, and/or, nerve net Network language model, wherein neural network language model may further include：RNNLM (Recognition with Recurrent Neural Network language model, Recurrent neural Network Language Model), CNNLM (convolutional neural networks language model, Convolutional Neural Networks Language Model), DNNLM (deep neural network language model, Deep Neural Networks Language Model) etc..

Wherein, N-gram language models based on it is such a it is assumed that i.e. the appearance of n-th word only and the word phase of front N-1 It closes, and it is all uncorrelated to other any words, and the probability of whole sentence is exactly the product of each word probability of occurrence.

Since N-gram language models predict n-th word using limited N-1 word (above), therefore N-gram language moulds Type can have the descriptive power of the probabilistic language model for the semantic segment that length is N, for example, N can be 3,5 etc. relatively fixed And numerical value be less than the first length threshold positive integer.And relative to the neural network language of N-gram language models, such as RNNLM One advantage of speech model is：Really fully next word can be predicted using all above, therefore RNNLM can have The descriptive power of the probabilistic language model of the variable semantic segment of length, that is, RNNLM is suitable for the semanteme of wider length range Segment, for example, the length range of the corresponding semantic segments of RNNLM can be：1~the second length threshold, wherein the second length threshold Value is more than the first length threshold.

In the embodiment of the present invention, semantic segment can be used for indicating the part of the global word sequence added with punctuation mark, institute Stating semantic segment may include：The continuous word (namely not including punctuation mark) of the overall situation word sequence, and/or, added with mark The continuous word of point symbol.It is alternatively possible to part be intercepted from above-mentioned global word sequence, to obtain above-mentioned continuous word.For example, right For global word sequence " hello, and/I is that/Xiao Ming/is very glad/recognizes you ", corresponding semantic segment may include：" you It is good/,/I am ", " I is that/Xiao Ming/is very glad " etc., wherein "/" is the explanation of application documents and the symbol that is arranged for convenience, "/" is for indicating the boundary between boundary, and/or word and punctuation mark between word, and in practical applications, "/" can not have For in all senses.

It should be noted that those skilled in the art can be according to practical application request, determining needs candidate mark to be added Point symbol, optionally, above-mentioned candidate's punctuation mark may include：Comma, question mark, fullstop, exclamation mark, space etc., wherein space It can play the role of word segmentation or cut little ice, for example, for English, space can be used for dividing different Word, for Chinese, space can be the punctuation mark to cut little ice, it will be understood that the embodiment of the present invention is for tool The candidate punctuation mark of body does not limit.

The embodiment of the present invention can be provided carries out punctuate addition processing to the global word sequence, described pending to obtain The following technical solution of the corresponding optimal punctuate addition result of text：

Technical solution 1,

Technical solution 1 may include：Obtain the corresponding a variety of punctuate addition results of the global word sequence；Determine the mark The corresponding probabilistic language model of point addition result；Language is selected from the corresponding a variety of punctuates addition results of the overall situation word sequence The optimal punctuate addition of model probability as the corresponding optimal punctuate of the pending text as a result, add result.

In practical applications, path planning algorithm may be used, obtain the corresponding a variety of punctuates of the global word sequence and add Add result.The principle of above-mentioned path planning algorithm can be, in the environment with barrier, according to certain evaluation criterion, A collisionless path from initial state to dbjective state is found, specific to the embodiment of the present invention, barrier can be used for indicating The candidate punctuation mark added between the adjacent word that pending text corresponds to global word sequence, initial state and dbjective state point The first word and the punctuation mark after the word of end that pending text corresponds to global word sequence are not indicated.

With reference to Fig. 2, show that a kind of pending text of the embodiment of the present invention corresponds to the path planning of global word sequence Schematic diagram, wherein pending text corresponds to global word order and is classified as " hello, and/I is that/Xiao Ming/is very glad/recognizes you ", then " hello/ I be/Xiao Ming/be very glad/recognize you " adjacent word between be possible to be added candidate punctuation mark；In Fig. 2, " hello ", The words such as " I is ", " Xiao Ming ", " being very glad ", " recognizing you " indicate with rectangle respectively, comma, space, exclamation, question mark, fullstop etc. Punctuation mark indicates that then pending text corresponds to the first word " hello " and end word " understanding of global word sequence with circle respectively You " after punctuation mark between can have mulitpath.

It is appreciated that path planning algorithm is intended only as the alternative embodiment of the embodiment of the present invention, actually this field skill Art personnel can obtain the corresponding a variety of punctuate addition knots of the pending text according to practical application request using other algorithms Fruit, it will be understood that the specific acquisition algorithm that the embodiment of the present invention adds result for a variety of punctuates does not limit.

In practical applications, the corresponding probabilistic language model of the punctuate addition result can be determined using language model, Language model may include accordingly：N-gram language models, and/or, neural network language model etc..

In a kind of alternative embodiment of the present invention, the determination punctuate adds the corresponding probabilistic language model of result Process may include：For the third semantic segment that each punctuate addition result includes, corresponding probabilistic language model is determined；It is right The corresponding probabilistic language model of all third semantic segments that each punctuate addition result includes is merged, to obtain corresponding language Say model probability；It can then be added from all punctuates in result and obtain the highest punctuate addition of probabilistic language model as a result, conduct The corresponding optimal punctuate of the pending text adds result.

It is alternatively possible to according to vertical sequence, acquisition pair in result is added from the punctuate by move mode The quantity of the third semantic segment answered, different the included character cells of third semantic segment can be identical, and adjacent second is semantic Segment may exist the character cell repeated, and the character cell may include：Word and/or punctuation mark.In such cases, may be used The corresponding probabilistic language model of third semantic segment is determined by N-gram language models and/or neural network language model.Assuming that N =5, the number of initial character unit is 1, then can be according to the following sequence of number：1-5,2-6,3-7,4-8 etc. are from the punctuate The third semantic segment that corresponding length is 5 is obtained in addition result, and determines that each third is semantic using N-gram language models The corresponding probabilistic language model of segment, for example, each third semantic segment is inputted N-gram, then exportable corresponding languages of N-gram Say model probability.

Optionally, it is above-mentioned to each punctuate addition result corresponding probabilistic language model of all third semantic segments for including into Row fusion process may include：The corresponding probabilistic language model of all third semantic segments for including to each punctuate addition result Carry out summation or product or weighted average processing etc., it will be understood that the embodiment of the present invention is tied for being added to each punctuate The detailed process that the corresponding probabilistic language model of all third semantic segments that fruit includes is merged does not limit.

In another alternative embodiment of the present invention, it is general that the determination punctuate adds the corresponding language model of result The process of rate may include：Using neural network language model, determine that all semantic segments of each punctuate addition result are corresponding Probabilistic language model；It can then be added from all punctuates in result and obtain the highest punctuate addition of probabilistic language model as a result, making Result is added for the corresponding optimal punctuate of the pending text.Since RNNLM is suitable for the semantic segment of wider length range, Therefore all semantic segments that each punctuate can be added to result are as a whole, determine that punctuate adds the institute of result by RNNLM There is the corresponding probabilistic language model of semantic segment, for example, all character cells that punctuate addition result includes are inputted into RNNLM, The then exportable corresponding probabilistic language models of RNNLM.

Technical solution 2,

Technical solution 2 may include：Using dynamic programming algorithm, punctuate addition processing is carried out to the global word sequence, Result is added to obtain the corresponding optimal punctuate of the pending text.

The principle of above-mentioned dynamic programming algorithm can be, by splitting problem, the pass between problem definition state and state System so that problem can go to solve in a manner of recursion (dividing and ruling in other words).Specific to the embodiment of the present invention, problem can be： The corresponding optimal punctuate addition of the pending text is found as a result, state can be the punctuate addition processing to global word sequence It is decomposed, to obtain the corresponding optimal punctuate addition result of the pending text in part and its corresponding target punctuation mark；Mesh The best candidate punctuation mark that mark punctuation mark can be used for indicating being added between adjacent word.It is exhaustive complete relative to technical solution 1 The corresponding a variety of punctuate addition results of office's word sequence and the therefrom optimal punctuate addition of selection probabilistic language model are as a result, technology The dynamic programming algorithm that scheme 2 uses can reduce operand, and as the pending text corresponds to the length of global word sequence The reduction amplitude of the increase of degree, operand will be increasing.

The embodiment of the present invention can be provided using dynamic programming algorithm, be carried out at punctuate addition to the global word sequence Reason, to obtain the following Dynamic Programming scheme that the corresponding optimal punctuate of the pending text adds result：

Dynamic Programming scheme 1,

It is above-mentioned to utilize dynamic programming algorithm in Dynamic Programming scheme 1, the global word sequence is carried out at punctuate addition Reason, to obtain the corresponding optimal punctuate addition of the pending text as a result, can specifically include：

Obtain the corresponding word sequence set of the global word sequence；

According to the subset sequence from small to large of the word sequence set, it is optimal to determine that each subset corresponds to by recursion mode Subset punctuate adds the target punctuation mark of result；The optimal subset punctuate adds the corresponding probabilistic language model of result most It is excellent；

Subset according to the word sequence set corresponds to the addition of optimal subset punctuate as a result, obtaining the pending text pair The optimal punctuate addition result answered.

Wherein, above-mentioned word sequence set can be used for indicating the word sequence for the continuous word composition that the global word sequence is included Set, optionally, the subset of above-mentioned word sequence set can be made of preceding i continuous words of global word sequence, for example, global word Sequence [C₁C₂…C_M] corresponding word sequence set may include：{C₁, C₁C₂, C₁C₂C₃..., C₁C₂…C_M, the word sequence set Including subset can be expressed as according to the sequence of subset length (namely subset include word quantity) from small to large：{C₁}、 {C₁C₂}、{C₁C₂C₃}…{C₁C₂…C_M, wherein C_iI-th of word for including for indicating pending text, i are just more than 0 Integer, M indicate the word quantity (namely length of global word sequence) of the pending text, and M is positive integer.On it is appreciated that State global word sequence [C₁C₂…C_M] length difference of adjacent subset is 1 to be intended only as optional implementation in corresponding word sequence set Example, in fact, global word sequence [C₁C₂…C_M] length difference of adjacent subset can also be more than 1 in corresponding word sequence set.

For each subset of word sequence set, it is general that corresponding subset punctuate addition result is corresponding with language model Rate, therefore the embodiment of the present invention can determine that each subset corresponds to the target punctuation mark of optimal subset punctuate addition result；It is described most The target punctuation mark of excellent subset punctuate addition result can be used for indicating subset corresponds to optimal subset punctuate add result it is optimal when, By which kind of Segmentation of Punctuation between adjacent word.Assuming that subset { C₁C₂C₃Corresponding optimal subset punctuate addition result is { (C₁), (C₂C₃), then illustrate subset { C₁C₂C₃In adjacent word " C₁" and " C₂" between by ", " segmentation, subset { C₁C₂C₃In adjacent word “C₂" and " C₃" between divided by space, corresponding target punctuation mark can be expressed as：“C₁" number 1 and comma, Ke Yili Solution, the embodiment of the present invention do not limit the specific representation of target punctuation mark.

The embodiment of the present invention can be true by recursion mode according to the subset sequence from small to large of the word sequence set Fixed each subset corresponds to the target punctuation mark that optimal subset punctuate adds result, it is assumed that according to the word sequence set subset from It is small to be expressed as each subset to big sequence：G₁、G₂、G₃…G_u, wherein u is positive integer, then can obtain G successively₁、G₂、G₃… G_uThe target punctuation mark of corresponding optimal subset punctuate addition result.Also, for Go (1≤o≤u), before needing Go Subset (such as G_o-1、G_o-2Deng) optimal subset punctuate addition as a result, determine Go corresponds to optimal subset punctuate addition result mesh Punctuation mark is marked, specifically, the optimal subset punctuate that Go can be multiplexed the subset before Go adds as a result, for example, subset {C₁C₂C₃C₄In punctuate addition processing between first 3 continuous words can be multiplexed subset { C₁C₂C₃Optimal subset punctuate addition As a result.

In a kind of alternative embodiment of the present invention, the subset of the continuous word sequence set may include：It is described to wait locating Preceding i continuous words of reason text, 0<The word quantity M that i≤pending text includes, then it is described according to the word sequence set Subset sequence from small to large, determine that each subset corresponds to the target punctuate that optimal subset punctuate adds result by recursion mode Symbol can specifically include：

The target punctuation mark that optimal subset punctuate adds result is corresponded to according to preceding k continuous words, it is continuous at the preceding i Punctuation mark is added between adjacent word in word, to obtain the preceding i continuous word corresponding at least one subset punctuate addition roads Diameter；Wherein, 0<k<I, k are positive integer；

Using neural network language model, determines that the subset punctuate adds paths and correspond to the language mould of the first semantic segment Type probability；

According to the probabilistic language model of first semantic segment, add paths middle choosing from least one subset punctuate The optimal optimal subset punctuate of probabilistic language model is selected to add paths；

It adds paths the punctuation mark for including according to the optimal subset punctuate, obtains the preceding i continuous words and correspond to most The target punctuation mark of excellent subset punctuate addition result.

Subset punctuate add paths can be used for indicating using the first word of subset as the end word of initial state and subset it Punctuation mark afterwards is the corresponding path of dbjective state.Optionally, it is corresponded to according to k continuous words between preceding i continuous words Optimal subset punctuate adds the target punctuation mark of result, and adjacent punctuate is added between this what k-th of word and i-th of word included Symbol, the corresponding at least one subset punctuate of a continuously words of i adds paths before can obtaining.Wherein, per subset punctuate addition Path can be corresponding with the first semantic segment, the i continuously corresponding punctuate additions of word before which can be used for indicating As a result.

Since RNNLM is suitable for the semantic segment of wider length range, for example, the length of the corresponding semantic segments of RNNLM It may range from：1~the second length threshold, therefore for 0<The word quantity M that i≤pending text includes, the present invention are implemented Example can utilize neural network language model, determine that the subset punctuate adds paths and correspond to the language model of the first semantic segment Probability.

Since a variety of punctuation marks can be added between the adjacent word of a pair of preceding i continuous words, therefore under normal conditions, preceding i The type that the corresponding subset punctuate of a continuous word adds paths is more than 1, and therefore, the embodiment of the present invention can be according to first language The probabilistic language model of adopted segment, it is optimal most from least one subset punctuate middle selection probabilistic language model that adds paths Excellent subset punctuate adds paths, and adds paths the punctuation mark for including according to the optimal subset punctuate, obtains the preceding i Continuous word corresponds to the target punctuation mark of optimal subset punctuate addition result.It is alternatively possible to further, preceding i continuous words pair The target punctuation mark for answering optimal subset punctuate addition result adds punctuation mark in preceding j continuous words between adjacent word, with The corresponding at least one subset punctuate of a continuously words of the preceding j is obtained to add paths；Wherein, j ＞ i, j are positive integer.

Optionally, above-mentioned to add paths the punctuation mark for including according to the optimal subset punctuate, obtain described first i even Continuous word corresponds to the target punctuation mark of optimal subset punctuate addition result, may include：Optimal subset punctuate is corresponded to each subset The target punctuation mark of addition result is recorded；Alternatively, the information to each subset and its corresponding optimal subset punctuate addition knot Mapping relations between the target punctuation mark of fruit are recorded, to obtain corresponding record content.Wherein, the letter of above-mentioned subset Breath may include：The number information of the corresponding end word of subset, and/or, corresponding number information of subset etc..For example, for preceding i A continuous word, corresponding number information can be i, correspond to the information etc. of end word namely i-th of word.It is appreciated that this Inventive embodiments do not limit the specifying information of subset.Wherein, record each subset correspond to optimal subset punctuate addition As a result during target punctuation mark, the target complete mark that each subset corresponds to optimal subset punctuate addition result can be recorded Point symbol can also record each subset and correspond to parts optimal subset punctuate addition result, different from adjacent previous subset Target punctuation mark.

In a kind of alternative embodiment of the present invention, the above-mentioned subset according to the word sequence set corresponds to optimal subset mark Point addition is as a result, obtain the corresponding optimal punctuate addition of the pending text as a result, can specifically include：

The maximal subset of the word sequence set is corresponded into the addition of optimal subset punctuate as a result, as the pending text Corresponding optimal punctuate adds result；And/or

Maximal subset according to the word sequence set corresponds to all target punctuates symbol of optimal subset punctuate addition result Number, make pauses in reading unpunctuated ancient writings to the word sequence, result is added to obtain the corresponding optimal punctuate of the pending text；And/or

Each subset according to the word sequence set corresponds to the partial target punctuation mark of optimal subset punctuate addition result, Make pauses in reading unpunctuated ancient writings to the word sequence, result is added to obtain the corresponding optimal punctuate of the pending text.

To sum up, Dynamic Programming scheme 1 is determined according to the subset sequence from small to large of word sequence set by recursion mode Each subset corresponds to the target punctuation mark of optimal subset punctuate addition result, and is corresponded to most according to the subset of the word sequence set Excellent subset punctuate addition is as a result, obtain the corresponding optimal punctuate addition result of the pending text；Wherein, the optimal subset The corresponding probabilistic language model of punctuate addition result is optimal, in this way, the subset before capable of being covered due to subset later, this Sample, the subset before the subset after can enabling is multiplexed correspond to the target punctuate symbol that optimal subset punctuate adds result Number, therefore the operand needed for the acquisition of optimal punctuate addition result can be reduced by recursion mode；Also, according to from it is small to Big sequence, subset can gradually cover the semantic segment that word sequence is included, and therefore, above-mentioned subset can gradually realize word order The included semantic segment of row corresponds to the optimal of probabilistic language model.

Dynamic Programming scheme 2,

It is above-mentioned to utilize dynamic programming algorithm in Dynamic Programming scheme 2, the global word sequence is carried out at punctuate addition Reason, to obtain the corresponding optimal punctuate addition of the pending text as a result, can specifically include：

Punctuation mark is added between adjacent word in the global word sequence, it is corresponding complete to obtain the global word sequence Office's punctuate adds paths；

According to vertical sequence, added from the global punctuate middle the acquisitions part punctuate that adds paths by move mode Add path and its corresponding second semantic segment；Wherein, the quantity of the included character cell of different second semantic segments is identical, phase There is the character cell repeated in the second adjacent semantic segment, the character cell includes：Word and/or punctuation mark；

According to vertical sequence, determine that the corresponding target punctuate of the second optimal semantic segment accords with by recursion mode Number；The optimal corresponding probabilistic language model of the second semantic segment is optimal；

According to the corresponding target punctuation mark of each second optimal semantic segment, obtains the pending text and correspond to Optimal punctuate add result.

Dynamic Programming scheme 2 adds paths middle acquisition according to vertical sequence, by move mode from global punctuate Length is identical (identical comprising character cell quantity) and there is the second semantic segment repeated, and according to vertical sequence, The corresponding target punctuation mark of the second optimal semantic segment is determined by recursion mode.Wherein, global punctuate adds paths Acquisition process is referred to Fig. 2, and the embodiment of the present invention does not limit the specific acquisition process that global punctuate adds paths. Local punctuate, which adds paths, can be used for indicating the part that global punctuate adds paths, each global punctuate adds paths can be right There should be the second semantic segment.

In practical applications, the corresponding probabilistic language model of the second semantic segment can be determined by N-gram language models.It is false If N=5, then the length of the second semantic segment can be 5, it is assumed that the number of the initial character unit of word sequence is 1, then can be according to The following sequence of number：1-5,2-6,3-7,4-8 etc. are added from the punctuate and are obtained the second language that corresponding length is 5 in result Adopted segment, and determine the corresponding probabilistic language model of each second semantic segment using N-gram language models, for example, by each second Semantic segment inputs N-gram, then the exportable corresponding probabilistic language models of N-gram.It certainly, also can be by neural network language mould Type (such as Recognition with Recurrent Neural Network language model) determines the corresponding probabilistic language model of the second semantic segment, the embodiment of the present invention for The specific determination process of the corresponding probabilistic language model of second semantic segment does not limit.It is appreciated that above-mentioned adjacent second Displacement distance between semantic segment is intended only as example for 1, in fact, those skilled in the art can be according to practical application need It asks, determines the displacement distance between above-mentioned adjacent second semantic segment, for example, the displacement distance can also be 2,3 etc..

It is above-mentioned according to vertical sequence in a kind of alternative embodiment of the present invention, it is determined most by recursion mode The corresponding target punctuation mark of the second excellent semantic segment, can specifically include：

Using N-gram language model and/or neural network language model, the corresponding language of current second semantic segment is determined Say model probability；

According to the corresponding probabilistic language model of current second semantic segment, from a variety of the second current semantic segments Select optimal current second semantic segment；

The punctuation mark for including using optimal current second semantic segment is as optimal current second semanteme The corresponding target punctuation mark of segment；

According to the optimal corresponding target punctuation mark of current second semantic segment, next second semantic segment is obtained.

It is corresponding second semantic that current second semantic segment can be used for indicating that in recursive process, local punctuate adds paths Field, it is assumed that the number of current second semantic segment is k, and k is positive integer, then can utilize N-gram language model and/or god Through netspeak model, the corresponding probabilistic language model of k-th of second semantic segments is determined, and semantic from a variety of k-th second K-th optimal of second semantic segments for selecting probabilistic language model optimal in segment, by k-th optimal of second semantic segments Including punctuation mark as corresponding target punctuation mark；And according to the optimal corresponding target of k-th of second semantic segments Punctuation mark obtains+1 the second semantic segment of kth, wherein+1 the second semantic segment of kth can be multiplexed optimal k-th The corresponding target punctuation mark of second semantic segment.By taking Fig. 2 as an example, it is assumed that the length of the second semantic segment is the 5, the optimal the 1st A second semantic segment is " hello/,/I am/space/Xiao Ming ", then the 2nd the second semantic segment " mark by punctuation mark/I is/ Point symbol/Xiao Ming/punctuation mark " can be multiplexed the optimal corresponding target punctuation mark of the 1st the second semantic segment, in this way, 2nd the second semantic segment can add punctuation mark on the basis of " ,/I be/space/Xiao Ming/punctuation mark ", in this way, Optimal punctuation mark can be selected from a variety of punctuation marks after " Xiao Ming ".

In practical applications, above-mentioned according to the corresponding target punctuation mark of each second optimal semantic segment, it obtains The corresponding optimal punctuate addition of the pending text is as a result, can specifically include：According to from back to front sequence or the past Sequence after arriving adds the global word sequence according to the corresponding target punctuation mark of each second optimal semantic segment Mark-on point symbol adds result to obtain the corresponding optimal punctuate of the pending text.That is, can according to certain sequence, Each punctuation mark position that determining overall situation punctuate adds paths corresponding target punctuation mark (between adjacent word), and according to above-mentioned Target punctuation mark obtains the corresponding optimal punctuate addition result of the pending text.

To sum up, Dynamic Programming scheme 2 adds road by move mode according to vertical sequence from the global punctuate Local punctuate is obtained in diameter to add paths and its corresponding second semantic segment, and according to vertical sequence, passes through recursion Mode determines the corresponding target punctuation mark of the second optimal semantic segment；Because there is repetition in the second adjacent semantic segment Character cell, therefore next second semantic segment can be multiplexed the optimal corresponding target punctuation mark of current second semantic segment, Therefore the operand needed for the acquisition of optimal punctuate addition result can be reduced by recursion mode；Also, due to adjacent second There is displacement distance, therefore the embodiment of the present invention can pass through the optimal probabilistic language model of the second semantic segment between semantic segment All second semantic segments of optimal realization correspond to the optimal of optimal probabilistic language model.

The optimal punctuate addition result output that step 104 can obtain step 103.It is appreciated that people in the art Member can be according to practical application request, and the optimal punctuate addition result that step 103 is obtained exports.For example, can be by step 103 Obtained optimal punctuate addition result is shown in the display device of current computing device；For another example, it can be set by currently calculating The standby optimal punctuate addition obtained to other computing device forwarding steps 103 is as a result, for example, be server in current computing device When, other computing devices can be client or other servers etc..

To sum up, the processing method of the embodiment of the present invention, by punctuate addition processing in the corresponding global word of pending text Target punctuation mark is added in sequence between adjacent word, and result is added by the optimal punctuate that above-mentioned punctuate addition is handled Corresponding probabilistic language model is optimal, and the optimal punctuate addition result may include：At least one semantic segment, above-mentioned semanteme Segment may include：The continuous word of the overall situation word sequence, and/or, it is added with the continuous word of punctuation mark；Due to above-mentioned language Model probability can be the synthesis that all semantic segments that the optimal punctuate addition result includes correspond to probabilistic language model, therefore The optimal punctuate addition result of the embodiment of the present invention can realize the global optimum of probabilistic language model, herein it is globally available in Indicate that pending text corresponds to the corresponding entirety of punctuate addition result, therefore the optimal punctuate of the embodiment of the present invention adds result energy Enough accuracy for improving addition punctuate.

It should be noted that for embodiment of the method, for simple description, therefore it is dynamic to be all expressed as a series of movement It combines, but those skilled in the art should understand that, the embodiment of the present invention is not limited by described athletic performance sequence System, because of embodiment according to the present invention, certain steps can be performed in other orders or simultaneously.Secondly, art technology Personnel should also know that embodiment described in this description belongs to preferred embodiment, and involved athletic performance simultaneously differs Surely it is necessary to the embodiment of the present invention.

Device embodiment

With reference to Fig. 3, shows a kind of structure diagram of processing unit embodiment of the present invention, can specifically include：It waits locating Manage text acquisition module 301, word-dividing mode 302, punctuate addition processing module 303 and result output module 304.

Wherein, pending text acquisition module 301, for obtaining pending text；

Word-dividing mode 302 is corresponding complete to obtain the pending text for being segmented to the pending text Office's word sequence；

Punctuate adds processing module 303, for carrying out punctuate addition processing to the global word sequence, to obtain described wait for It handles the corresponding optimal punctuate of text and adds result；Wherein, the punctuate addition processing adjacent word in the global word sequence Between add target punctuation mark, the corresponding probabilistic language model of the optimal punctuate addition result is optimal, the optimal punctuate Adding result may include：At least one semantic segment, the semantic segment may include：It is described the overall situation word sequence it is continuous Word, and/or, it is added with the continuous word of punctuation mark；And

As a result result is added in output module 304 for exporting the optimal punctuate.

Optionally, the punctuate addition processing module 303 may include：

Optionally, the Dynamic Programming processing submodule may include：

Optionally, the subset of the continuous word sequence set may include：Preceding i continuous words of the pending text, 0 <The word quantity M that i≤pending text includes, then the first recursion unit may include：

Optionally, the Dynamic Programming processing submodule may include：

Mobile acquiring unit, for according to vertical sequence, road to be added from the global punctuate by move mode Local punctuate is obtained in diameter to add paths and its corresponding second semantic segment；Wherein, the included word of different second semantic segments The quantity for according with unit is identical, and the second adjacent semantic segment has the character cell repeated, and the character cell may include：Word And/or punctuation mark；

Optionally, the second recursion unit may include：

Optionally, the second optimal result acquiring unit may include：

Optionally, the punctuate addition processing module 303 may include：

For device embodiments, since it is basically similar to the method embodiment, so fairly simple, the correlation of description Place illustrates referring to the part of embodiment of the method.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with The difference of other embodiment, the same or similar parts between the embodiments can be referred to each other.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Fig. 4 be a kind of device for information processing shown according to an exemplary embodiment as terminal when block diagram. For example, the terminal 900 can be mobile phone, computer, digital broadcast terminal, messaging devices, game console, tablet Equipment, Medical Devices, body-building equipment, personal digital assistant etc..

With reference to Fig. 4, terminal 900 may include following one or more components：Processing component 902, memory 904, power supply Component 906, multimedia component 908, audio component 910, the interface 912 of input/output (I/O), sensor module 914, and Communication component 916.

The integrated operation of 902 usual control terminal 900 of processing component, such as with display, call, data communication, phase Machine operates and record operates associated operation.Processing element 902 may include that one or more processors 920 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 902 may include one or more modules, just Interaction between processing component 902 and other assemblies.For example, processing component 902 may include multi-media module, it is more to facilitate Interaction between media component 908 and processing component 902.

Memory 904 is configured as storing various types of data to support the operation in terminal 900.These data are shown Example includes instruction for any application program or method that are operated in terminal 900, contact data, and telephone book data disappears Breath, picture, video etc..Memory 904 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 906 provides electric power for the various assemblies of terminal 900.Power supply module 906 may include power management system System, one or more power supplys and other generated with for terminal 900, management and the associated component of distribution electric power.

Multimedia component 908 is included in the screen of one output interface of offer between the terminal 900 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding motion The boundary of action, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 908 includes a front camera and/or rear camera.When terminal 900 is in operation mode, mould is such as shot When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 910 is configured as output and/or input audio signal.For example, audio component 910 includes a Mike Wind (MIC), when terminal 900 is in operation mode, when such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 904 or via communication set Part 916 is sent.In some embodiments, audio component 910 further includes a loud speaker, is used for exports audio signal.

I/O interfaces 912 provide interface between processing component 902 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor module 914 includes one or more sensors, and the state for providing various aspects for terminal 900 is commented Estimate.For example, sensor module 914 can detect the state that opens/closes of terminal 900, and the relative positioning of component, for example, it is described Component is the display and keypad of terminal 900, and sensor module 914 can be with 900 1 components of detection terminal 900 or terminal Position change, the existence or non-existence that user contacts with terminal 900,900 orientation of terminal or acceleration/deceleration and terminal 900 Temperature change.Sensor module 914 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 914 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 916 is configured to facilitate the communication of wired or wireless way between terminal 900 and other equipment.Terminal 900 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or combination thereof.In an exemplary implementation In example, communication component 916 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 916 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, terminal 900 can be believed by one or more application application-specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of Such as include the memory 904 of instruction, above-metioned instruction can be executed by the processor 920 of terminal 900 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Fig. 5 be a kind of device for information processing shown according to an exemplary embodiment as server when frame Figure.The server 1900 can generate bigger difference because configuration or performance are different, may include in one or more Central processor (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or data 1944 storage medium 1930 (such as one or one with Upper mass memory unit).Wherein, memory 1932 and storage medium 1930 can be of short duration storage or persistent storage.It is stored in The program of storage medium 1930 may include one or more modules (diagram does not mark), and each module may include to clothes The series of instructions operation being engaged in device.Further, central processing unit 1922 could be provided as communicating with storage medium 1930, The series of instructions operation in storage medium 1930 is executed on server 1900.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

In the exemplary embodiment, it includes the non-transitorycomputer readable storage medium instructed, example to additionally provide a kind of Such as include the memory 1932 of instruction, above-metioned instruction can be executed by the processor 1922 of server 1900 to complete the above method. For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, Floppy disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (terminal or Server) processor execute when so that device is able to carry out a kind of processing method, the method includes：Obtain pending text This；The pending text is segmented, to obtain the corresponding global word sequence of the pending text；To the global word Sequence carries out punctuate addition processing, and result is added to obtain the corresponding optimal punctuate of the pending text；Wherein, the punctuate Target punctuation mark is added in addition processing in the global word sequence between adjacent word, the optimal punctuate addition result corresponds to Probabilistic language model it is optimal, the optimal punctuate addition result includes：At least one semantic segment, the semantic segment packet It includes：The continuous word of the overall situation word sequence, and/or, it is added with the continuous word of punctuation mark；Export the optimal punctuate addition knot Fruit.

Optionally, described that punctuate addition processing is carried out to the global word sequence, it is corresponded to obtaining the pending text Optimal punctuate add as a result, including：Using dynamic programming algorithm, punctuate addition processing is carried out to the global word sequence, with Obtain the corresponding optimal punctuate addition result of the pending text.

Optionally, described to utilize dynamic programming algorithm, punctuate addition processing is carried out to the global word sequence, to obtain The corresponding optimal punctuate addition of pending text is stated as a result, including：Obtain the corresponding word sequence set of the global word sequence；It presses According to the subset sequence from small to large of the word sequence set, determine that each subset corresponds to optimal subset punctuate and adds by recursion mode Add the target punctuation mark of result；It is optimal that the optimal subset punctuate adds the corresponding probabilistic language model of result；According to described in The subset of word sequence set corresponds to the addition of optimal subset punctuate as a result, obtaining the corresponding optimal punctuate addition of the pending text As a result.

Optionally, the subset of the continuous word sequence set includes：Preceding i continuous words of the pending text, 0<i≤ The word quantity M that the pending text includes, then the sequence of the subset according to the word sequence set from small to large, passes through Recursion mode determines that each subset corresponds to the target punctuation mark of optimal subset punctuate addition result, including：According to preceding k continuous words The target punctuation mark of corresponding optimal subset punctuate addition result, punctuate is added in the preceding i continuous words between adjacent word Symbol is added paths with obtaining the corresponding at least one subset punctuate of a continuously words of the preceding i；Wherein, 0<k<I, k are just whole Number；Using neural network language model, determine the subset punctuate add paths corresponding first semantic segment language model it is general Rate；According to the probabilistic language model of first semantic segment, add paths middle selection language from least one subset punctuate The optimal optimal subset punctuate of speech model probability adds paths；Add paths the punctuate symbol for including according to the optimal subset punctuate Number, obtain the target punctuation mark that the preceding i continuous words correspond to optimal subset punctuate addition result.

Optionally, described to utilize dynamic programming algorithm, punctuate addition processing is carried out to the global word sequence, to obtain The corresponding optimal punctuate addition of pending text is stated as a result, including：In the global word sequence punctuate is added between adjacent word Symbol is added paths with obtaining the corresponding global punctuate of the global word sequence；According to vertical sequence, by movement side Formula adds paths and its corresponding second semantic segment from the global punctuate middle the acquisitions part punctuate that adds paths；Wherein, no Identical with the quantity of the included character cell of the second semantic segment, there is the character cell repeated in the second adjacent semantic segment, The character cell includes：Word and/or punctuation mark；According to vertical sequence, optimal is determined by recursion mode The corresponding target punctuation mark of two semantic segments；The optimal corresponding probabilistic language model of the second semantic segment is optimal；According to institute The corresponding target punctuation mark of each the second optimal semantic segment is stated, the corresponding optimal punctuate addition of the pending text is obtained As a result.

Optionally, described according to vertical sequence, determine that the second optimal semantic segment corresponds to by recursion mode Target punctuation mark, including：Using N-gram language model and/or neural network language model, determine that current second is semantic The corresponding probabilistic language model of segment；According to the corresponding probabilistic language model of current second semantic segment, from a variety of current The second semantic segment in select optimal current second semantic segment；Include by optimal current second semantic segment Punctuation mark is as the optimal corresponding target punctuation mark of current second semantic segment；According to optimal current second language The corresponding target punctuation mark of adopted segment, obtains next second semantic segment.

Optionally, described according to the corresponding target punctuation mark of each second optimal semantic segment, obtain described wait for The corresponding optimal punctuate addition of text is handled as a result, including：According to from back to front sequence or vertical sequence, according to According to the corresponding target punctuation mark of each second optimal semantic segment, punctuation mark is added to the global word sequence, with Obtain the corresponding optimal punctuate addition result of the pending text.

Optionally, described that punctuate addition processing is carried out to the global word sequence, it is corresponded to obtaining the pending text Optimal punctuate add as a result, including：Obtain the corresponding a variety of punctuate addition results of the global word sequence；Determine the punctuate Add the corresponding probabilistic language model of result；Language mould is selected from the corresponding a variety of punctuates addition results of the overall situation word sequence The optimal punctuate addition of type probability as the corresponding optimal punctuate of the pending text as a result, add result.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention Its embodiment.The present invention is directed to cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and includes the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the invention is not limited in the precision architectures for being described above and being shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Above to a kind of processing method provided by the present invention, a kind of processing unit and a kind of device for processing, It is described in detail, principle and implementation of the present invention are described for specific case used herein, the above reality The explanation for applying example is merely used to help understand the method and its core concept of the present invention；Meanwhile for the general technology of this field Personnel, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this theory Bright book content should not be construed as limiting the invention.

Claims

1. a kind of processing method, which is characterized in that including：

Obtain pending text；

Punctuate addition processing is carried out to the global word sequence, to obtain the corresponding optimal punctuate addition knot of the pending text Fruit；Wherein, the punctuate addition processing adds target punctuation mark in the global word sequence between adjacent word, described optimal The corresponding probabilistic language model of punctuate addition result is optimal, and the optimal punctuate addition result includes：At least one semantic segment, The semantic segment includes：The continuous word of the overall situation word sequence, and/or, it is added with the continuous word of punctuation mark；

Export the optimal punctuate addition result.

2. according to the method described in claim 1, it is characterized in that, described carry out at punctuate addition the global word sequence Reason, to obtain the corresponding optimal punctuate addition of the pending text as a result, including：

Using dynamic programming algorithm, punctuate addition processing is carried out to the global word sequence, to obtain the pending text pair The optimal punctuate addition result answered.

3. according to the method described in claim 2, it is characterized in that, described utilize dynamic programming algorithm, to the global word order Row carry out punctuate addition processing, to obtain the corresponding optimal punctuate addition of the pending text as a result, including：

Obtain the corresponding word sequence set of the global word sequence；

According to the subset sequence from small to large of the word sequence set, determine that each subset corresponds to optimal subset by recursion mode Punctuate adds the target punctuation mark of result；It is optimal that the optimal subset punctuate adds the corresponding probabilistic language model of result；

Subset according to the word sequence set corresponds to the addition of optimal subset punctuate as a result, to obtain the pending text corresponding Optimal punctuate adds result.

4. according to the method described in claim 3, it is characterized in that, the subset of the continuous word sequence set includes：It is described to wait for Preceding i continuous words of processing text, 0<The word quantity M that i≤pending text includes, then it is described according to the word sequence collection The sequence of the subset of conjunction from small to large determines that each subset corresponds to the target mark that optimal subset punctuate adds result by recursion mode Point symbol, including：

The target punctuation mark that optimal subset punctuate adds result is corresponded to according to preceding k continuous words, in the preceding i continuous words Punctuation mark is added between adjacent word, is added paths with obtaining the corresponding at least one subset punctuate of a continuously words of the preceding i；Its In, 0<k<I, k are positive integer；

Using neural network language model, determine the subset punctuate add paths corresponding first semantic segment language model it is general Rate；

According to the probabilistic language model of first semantic segment, add paths middle selection language from least one subset punctuate The optimal optimal subset punctuate of speech model probability adds paths；

It adds paths the punctuation mark for including according to the optimal subset punctuate, obtains the preceding i continuous words and correspond to optimal son Collect the target punctuation mark of punctuate addition result.

5. according to the method described in claim 2, it is characterized in that, described utilize dynamic programming algorithm, to the global word order Row carry out punctuate addition processing, to obtain the corresponding optimal punctuate addition of the pending text as a result, including：

Punctuation mark is added between adjacent word in the global word sequence, to obtain the corresponding global mark of the global word sequence Point adds paths；

According to vertical sequence, added paths middle acquisitions local punctuate addition road from the global punctuate by move mode Diameter and its corresponding second semantic segment；Wherein, the quantity of the included character cell of different second semantic segments is identical, adjacent There is the character cell repeated in the second semantic segment, the character cell includes：Word and/or punctuation mark；

According to vertical sequence, the corresponding target punctuation mark of the second optimal semantic segment is determined by recursion mode； The optimal corresponding probabilistic language model of the second semantic segment is optimal；

According to the corresponding target punctuation mark of each second optimal semantic segment, it is corresponding most to obtain the pending text Excellent punctuate adds result.

6. according to the method described in claim 5, it is characterized in that, described according to vertical sequence, pass through recursion mode Determine the corresponding target punctuation mark of the second optimal semantic segment, including：

Using N-gram language model and/or neural network language model, the corresponding language mould of current second semantic segment is determined Type probability；

According to the corresponding probabilistic language model of current second semantic segment, selected from a variety of the second current semantic segments Optimal current second semantic segment；

The punctuation mark for including using optimal current second semantic segment is as optimal current second semantic segment Corresponding target punctuation mark；

7. according to the method described in claim 5, it is characterized in that, described correspond to according to each second optimal semantic segment Target punctuation mark, obtain the corresponding optimal punctuate addition of the pending text as a result, including：

According to from back to front sequence or vertical sequence, it is corresponding according to each second optimal semantic segment Target punctuation mark adds punctuation mark, to obtain the corresponding optimal punctuate of the pending text to the global word sequence Add result.

8. according to the method described in claim 1, it is characterized in that, described carry out at punctuate addition the global word sequence Reason, to obtain the corresponding optimal punctuate addition of the pending text as a result, including：

Obtain the corresponding a variety of punctuate addition results of the global word sequence；

Determine the corresponding probabilistic language model of the punctuate addition result；

The optimal punctuate of selection probabilistic language model adds knot from the overall situation word sequence corresponding a variety of punctuates addition results Fruit adds result as the corresponding optimal punctuate of the pending text.

9. a kind of processing unit, which is characterized in that including：

Pending text acquisition module, for obtaining pending text；

Word-dividing mode, for being segmented to the pending text, to obtain the corresponding global word order of the pending text Row；

Punctuate adds processing module, for carrying out punctuate addition processing to the global word sequence, to obtain the pending text This corresponding optimal punctuate adds result；Wherein, the punctuate addition processing adds in the global word sequence between adjacent word Add target punctuation mark, the corresponding probabilistic language model of the optimal punctuate addition result is optimal, the optimal punctuate addition knot Fruit includes：At least one semantic segment, the semantic segment include：The continuous word of the overall situation word sequence, and/or, it is added with The continuous word of punctuation mark；And

As a result output module adds result for exporting the optimal punctuate.

10. a kind of device for processing, which is characterized in that include memory and one or more than one program, Either more than one program is stored in memory and is configured to be executed by one or more than one processor for one of them The one or more programs include the instruction for being operated below：

Obtain pending text；

Export the optimal punctuate addition result.