CN107247706B

CN107247706B - Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment

Info

Publication number: CN107247706B
Application number: CN201710458179.9A
Authority: CN
Inventors: 谢瑜; 张昊; 朱频频
Original assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD; Shanghai Xiaoi Robot Technology Co Ltd; China Electronics Standardization Institute
Current assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD; Shanghai Xiaoi Robot Technology Co Ltd; China Electronics Standardization Institute
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2021-06-25
Anticipated expiration: 2037-06-16
Also published as: CN107247706A

Abstract

The invention provides a text punctuation model establishing method, a punctuation method, a device and a computer device, wherein the text punctuation model establishing method comprises the following steps: performing word segmentation on a training corpus to obtain words corresponding to the training corpus; adding characteristic information to the words, wherein the characteristic information comprises pause information; and training words corresponding to the training corpus based on the characteristic information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model. Corresponding to the method, the invention also provides a sentence-breaking method, a sentence-breaking device and computer equipment.

Description

Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment

Technical Field

The invention relates to the technical field of intelligent interaction, in particular to a method and a device for establishing a text punctuation model.

Background

At present, communication interaction is more and more common in a voice mode, and after voice content is stored in a text form, the stored text usually has no punctuation marks or even has no discontinuous information, so that a barrier is provided for reading and understanding the stored text.

Disclosure of Invention

The invention provides a method for establishing a text sentence-breaking model, which can more accurately break sentences of data without pause information.

According to the above object, the present invention provides a method for establishing a text sentence-breaking model, wherein the method comprises: performing word segmentation on a training corpus to obtain words corresponding to the training corpus; adding characteristic information to the words, wherein the characteristic information comprises pause information; and training words corresponding to the training corpus based on the characteristic information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model.

In an embodiment, the method further comprises: using the text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results; judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; if not, adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after adjusting the feature information time threshold parameter and/or the fitting parameter on the training corpus is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after adjusting as the final text sentence breaking model.

Corresponding to the method, the invention also provides a device for establishing the text punctuation model, wherein the device comprises: the word segmentation module is used for segmenting words of a training corpus to obtain words corresponding to the training corpus; the characteristic information adding module is used for adding characteristic information to the words, and the characteristic information comprises pause information; and the training module is used for training the words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm so as to obtain the text punctuation model.

In one embodiment, the apparatus further comprises: the test module is used for carrying out sentence breaking on test data by using the text sentence breaking model to obtain a sentence breaking result; the accuracy judging module is used for judging whether the accuracy of the sentence breaking result is greater than or equal to an accuracy threshold value or not; if the determination of the accuracy determining module 305 is no, the apparatus further includes: and the parameter adjusting module is used for adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after the feature information time threshold parameter and/or the fitting parameter is adjusted on the training corpus is greater than or equal to the accuracy threshold, and then the text sentence breaking model obtained by training after the adjustment is used as the final text sentence breaking model.

The conditional random field algorithm is applied to the training of the sentence-breaking model, and the advantages of the conditional random field algorithm are fully applied, so that the sentence-breaking accuracy of the sentence-breaking model is higher.

Drawings

FIG. 1 is a flow diagram illustrating one aspect of a method of text sentence pattern creation of the present invention;

FIG. 2 is a flow diagram illustrating another aspect of a method of text sentence pattern modeling in accordance with the present invention;

FIG. 3 shows a flow chart of a method of sentence-breaking voice data;

FIG. 4 shows a schematic diagram of an apparatus for text sentence break model building in accordance with an aspect of the present invention.

Detailed Description

In order to add pause information to a text without pause marks, the invention provides a method for establishing a text punctuation model.

In one embodiment, referring to fig. 1, fig. 1 is a flow chart of an aspect of a method for text sentence break model building according to the present invention, the method comprising:

step 101: performing word segmentation on the training corpus to obtain words corresponding to the training corpus;

step 102: adding characteristic information to the word, wherein the characteristic information comprises pause information;

step 103: and training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain a text sentence break model.

The corpus is text data at least having pause information, the corpus may have punctuation marks, the punctuation marks are pause marks of the corpus, the punctuation marks may be based on the punctuation marks and on manual checking and labeling pause information, the corpus may not have punctuation marks, and only based on manual checking and labeling pause information. That is, the sentence-breaking model needs to be trained by using the data already having sentence-breaking information. Finding out the rule of sentence interruption information of the training corpus, and further establishing a sentence interruption model.

Because the sentences and words used in different fields have certain regularity, sentence-breaking models corresponding to the fields can be trained respectively aiming at the different fields, for example, different sentence-breaking models can be respectively established for the fields of telecom customer service, military, finance, science and technology and the like.

Because the words are basic units formed by the text and no discontinuous information appears in the middle of one word, the step 101 performs word segmentation on the training corpus, so that the text needing sentence segmentation is converted into corresponding words.

Words have many attributes, i.e., characteristics, such as part-of-speech, semantics, sentence components (e.g., subject, predicate, object, etc.), and so forth. These properties of a word are usually associated with whether the word is at a pause in a sentence, and by considering the above properties of the word and the interrelationship between the positions of the words, it can be derived which words should be at the pause in the sentence.

For example, the sentence "Shanghai general use a procurement agreement with U.S. general endorsement amount of $ 3.06 hundred million formally before the day, is used for purchasing the whole vehicle and parts of the latter. The "protocol" and "component" at the pause are both terms, that is, there is a high possibility that pause information will appear after the terms under certain conditions.

Step 102 is executed to add characteristic information to the word, wherein the characteristic information comprises pause information. Firstly, the existing pause information of the words of the training corpus is used for training the sentence break model, namely, the pause rule of the words corresponding to the existing training corpus with the pause information is found out, and the sentence break model is built by utilizing the pause rule.

Preferably, the sentence-breaking model is trained based on the pause information of the training corpus.

Preferably, the feature information further includes position information of the word and part-of-speech information of the word.

Preferably, the characteristic information further includes sentence component information.

The step of adding feature information to the word further comprises:

sentence component information is added to the words.

In particular, sentence component information may be added to the word by parsing the word.

In one embodiment, the pause information of the word corresponding to the first training corpus before the pause symbol is marked as a first mark; and marking the pause information of other words as a second mark.

For example, a purchase agreement on Shanghai general daily official and American general endorsement amounts to $ 3.06 million is used for purchasing the latter whole vehicle and parts. The results of word segmentation and pause information addition are shown in table 1.

TABLE 1 participle and add pause information

Word	Pause information	Word	Pause information
				General for Shanghai	S	Is/are as follows	S
Day ahead	S	Procurement	S
				Formal form	S	Protocol	E
And	S	for	S
				USA	S	Procurement	S
Universal automobile	S	The latter being	S
				Sign the sign	S	Is/are as follows	S
Amount of money	S	Whole vehicle	S
				To achieve	S	And	S
3.06 billion dollars	S	Details of the components	E

Wherein S represents that the word is not positioned at the pause of the text, and E represents that the word is positioned at the pause of the sentence.

Preferably, the sentence-breaking model is trained by simultaneously considering other feature information of the words, and it can be expected that the more feature information of the words is applied, the higher the accuracy of the trained sentence-breaking model is.

In one embodiment, the feature information added to the word further includes position information of the word and part-of-speech information of the word. The positional information of the predicated word, i.e., the correlation of the positions of the aforementioned words, is in the middle of "Shanghai general" and "formal" in the foregoing example. Of course, the range of the location information may be comprehensively considered as needed, and for example, the word before "usa" may be "and", the word after "general purpose car", the second word before "official", and the second word after "signed" may also be considered. The wider the range contained by the position information is, the higher the accuracy of sentence breaking of the trained sentence breaking model is.

Each word in table 1 already contains the mutual position information of each word, and the range of the contained position information may be selected as needed during training.

After the part-of-speech information is added to the word, the segmentation result with the pause information and the part-of-speech information is shown in table 2.

TABLE 2 word segmentation and addition of pause information and part-of-speech information

Wherein, the part-of-speech meaning represented by each English letter combination is as follows:

noun n, time word t, place word s, orientation word f, number word m, quantifier q, distinguishment word b, pronoun r, verb v, adjective a, state word z, adverb d, preposition word p, conjunctive word c, auxiliary word u, inflexion word y, exclamation word e, vocalism o, idiom i, idiom l, abbreviation j, concatenation element h, concatenation element k, morpheme g, non-morpheme character x and punctuation mark w.

Preferably, sentence component information is added to the words at the same time. The sentence component information includes subjects, predicates, objects, determinants, subjects, complements, and the like.

In one embodiment, sentence component information is added to a word by parsing the word.

In one embodiment, semantic information is added to a word, the semantic information being determined by the word's own meaning, which may be obtained from the word's textual representation. Different word expressions may have the same meaning and in one embodiment words having different word expressions but the same meaning are mapped to the same word.

Step 103 can be executed based on the attribute of the word and the position information of the word, and the word corresponding to the training corpus is trained by using the conditional random field algorithm to obtain a text sentence-breaking model. When model training is carried out, the training is carried out word by word.

In order to more efficiently select the attribute of each word and the range of position information of the word when training each word, in an embodiment, a conditional random field algorithm is used to extract the word and its feature information corresponding to a preset feature template according to the preset feature template, so as to train the training corpus to obtain the text sentence break model, wherein the extracted feature information at least includes pause information, and the preset feature template is used to represent the word and its feature information whose relationship with the current word to be trained satisfies a preset requirement.

That is, the feature template specifies the words used for training and the feature information corresponding to the words. In one embodiment, the relationship that the feature template represents to the trained current word includes any one or more of the following combinations of information: combining semantic information of the current word and pause information of the current word; combining the part-of-speech information of the current word and the pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; the semantic information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word are combined; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word and the pause information of the current word; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.

The semantic information is expressed by the text of the current word or can be obtained by matching the current word according to a preset semantic information matching method after word segmentation, such as matching a preset word vector, matching a preset synonym library, matching a preset near-sense thesaurus and the like. The semantic information can be represented by characters, word vectors, a preset synonym library in which the semantic information is located, a preset near synonym library in which the semantic information is located, and the like.

By adopting the characteristic template, the characteristic information of the word used for training not only comprises pause information, but also comprises position information of the word, part-of-speech information of the word and the like.

That is, one feature template specification is trained in consideration of only semantic information of a current word, another feature template specification is trained in consideration of part-of-speech information of the current word, and a template specification is trained in consideration of a combination of part-of-speech information of a previous word, part-of-speech information of the current word, and part-of-speech information of a next word. Of course, the feature templates do not necessarily include the above list, and various training scopes should be included in the scope of the present invention.

A code representation mode of the feature template is given below (the feature template includes pause information of a word to be extracted by default):

u03% x [0,0] # current word sense

U04% x [0,1] # current word part of speech

U05:% x [ -1,0 ]/% x [0,0] # combination of previous word semantics and current word semantics

U06:% x [0,0 ]/% x [1,1] # combination of current word semantics and next word part of speech

U20:％x[-2,1]/％x[-1,1]/％x[0,1]

The combination of the part of speech of the last word on the # and the parts of speech of the last word and the current word

U24:％x[-1,1]/％x[0,1]/％x[1,1]

Combination of part of speech of # previous word, part of speech of current word and part of speech of next word

The U03 code of% x [0,0] represents training the model by using semantic information of the current word, the U04 code of% x [0,1] represents training the model by using part-of-speech information of the current word, and the meanings of other codes are analogized.

When the model training is carried out, one characteristic template can be selected for training, a plurality of characteristic templates can also be selected for training at the same time, the more the number of the applied characteristic templates is, the more the considered words and the corresponding characteristic information are, the better the model training effect is, and the higher the sentence-breaking accuracy of the trained sentence-breaking model is.

To further improve the accuracy of the trained sentence break model, in one embodiment, please refer to fig. 2, fig. 2 shows a flowchart of another aspect of a method for building a text sentence break model according to the present invention, the method includes:

step 201: performing word segmentation on the training corpus to obtain words corresponding to the training corpus;

step 202: adding characteristic information to the word, wherein the characteristic information comprises pause information;

step 203: and training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain a text sentence break model.

Step 204: using a text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results;

step 205: judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; if not, go to step 206, and if so, go to step 208.

Step 206: adjusting a characteristic information time threshold parameter and/or a fitting parameter of the conditional random field algorithm;

step 207: judging that the accuracy of a sentence breaking result of a text sentence breaking model obtained by training after adjusting the feature information times threshold parameter and/or the fitting parameter is greater than or equal to an accuracy threshold, if not, returning to the step 206, and if yes, entering the step 208;

step 208: and (6) ending.

And after the sentence break model trained by the training corpus is obtained, testing the obtained sentence break model by using the test data. That is, the performance of the sentence-break model is tested by using the data of the known pause information.

For example, "Shanghai general use a procurement agreement formally with U.S. general endorsement in the amount of $ 3.06 billion before the day, for procurement of the latter's entire vehicle and parts, is known. "the punctuation information is that the punctuated sentence, that is, the" the purchasing agreement that the Shanghai general service reaches $ 3.06 million with the American general endorsement amount is used for purchasing the whole vehicle and parts of the latter "is used as the test data, the obtained punctuation model is input, the punctuation model outputs the punctuation result of the test data, and the punctuation result is compared with the known punctuation result, so that the accuracy of the punctuation model can be judged.

And when the accuracy is greater than or equal to the accuracy threshold, the accuracy of the sentence-breaking model is considered to be high enough, and the sentence-breaking model can be used for carrying out sentence breaking of other texts.

Of course, the testing process can be repeated, and the comprehensive accuracy of each test is comprehensively measured by a statistical method, and whether the comprehensive accuracy meets the requirement or not is judged.

And if the accuracy obtained by the test does not reach the accuracy threshold, adjusting the feature information frequency threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of a sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained by training after the feature information frequency threshold parameter and/or the fitting parameter is adjusted is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after the adjustment as a final text sentence breaking model.

The feature information frequency threshold parameter refers to a frequency threshold value of occurrence of a mutual relationship formed by feature information of words in the training corpus.

For example, when training corpus is performed by considering the feature template "a combination of the part of speech of the previous word, the part of speech of the current word and the part of speech of the next word", the relationship between the feature information and each other is "noun adverb noun", which means that the feature information relationship of each word in which the current word is an adverb and the previous word and the next word are nouns.

The fitting parameters refer to hyper-parameters in the conditional random field algorithm, the balance between over-fitting and non-fitting can be adjusted, and the larger the fitting parameter value is, the higher the degree of the conditional random field algorithm to fit the training data is.

Based on experimental experience, in one embodiment, the feature information times threshold parameter of the conditional random field algorithm is adjusted in a range of 1 to 5, and the fitting parameter of the conditional random field algorithm is adjusted in a range of 1 to 3.

In one embodiment, the sentence break model is tested by the input of speech. At this time, first, voice recognition is performed on voice test data, the voice data is converted into text data, i.e., a voice data text, and then a sentence break model is tested using the voice data text.

Preferably, pause symbols are added to the voice data text, and due to the fact that the test data are used, the pause symbols can be added manually, and the pause can be marked symbols.

With punctuation marks, pause information can be added to the voice data text, i.e. the place with punctuation marks is the place of pause.

By using the text with pause information, the method can be used for testing the sentence-breaking accuracy of the sentence-breaking model.

If the sentence-breaking model is tested by using the voice data, the previous processing procedure includes: carrying out voice recognition on the voice test data to obtain a voice data text; performing word segmentation on the voice data text to obtain words corresponding to the voice data text; adding pause symbols for the voice data text; based on the pause symbol, pause information is added to the voice data text.

In one embodiment, the word segmentation operation may be performed by using a word segmentation dictionary. The method includes the steps that a large number of words are recorded in a word segmentation dictionary, a text needing word segmentation is compared with each word in the word segmentation dictionary, and if the words existing in the word segmentation dictionary appear in the text, the corresponding characters are set as the words.

In one embodiment, new words are found in the training corpus, and the obtained new words are added into the word segmentation dictionary. When the training corpus is found to have words which are not in the word segmentation dictionary, the words can be distinguished in a manual mode or a new word finding method, and the words are added into the word segmentation dictionary.

In one embodiment, when a text sentence-breaking model is used to break a sentence on test data, the method comprises the following steps: performing sentence breaking on the test data by using a text sentence breaking model to obtain a plurality of primary sentence breaking results; and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training the language data of the standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.

Because the conditional random field algorithm is probability-based, a plurality of punctuation results can be output according to the size of the punctuation probability of punctuation through the punctuation model trained by the conditional random field algorithm.

At the moment, the total sentence probability of each primary sentence-breaking result is respectively calculated through an n-gram language model obtained through the language data training of the standard sentence-breaking, and the corresponding primary sentence-breaking result with the highest total sentence probability is used as the sentence-breaking result. Therefore, the final sentence-breaking result is determined through multi-level screening, and the sentence-breaking accuracy is improved.

The N-gram language model is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.

In one embodiment, the total sentence probability of the primary sentence-break result is the product of the sentence-forming probabilities of the clauses of the primary sentence-break result.

Turning to fig. 3, fig. 3 shows a flow chart of a method of sentence-breaking voice data.

Step 301: carrying out voice recognition on voice data to generate a voice data text;

step 302: inputting the voice data text into a sentence-breaking model to obtain a plurality of primary sentence-breaking results;

step 303: and inputting the plurality of primary sentence-breaking results into the n-gram language model, judging the total sentence probability of each primary sentence-breaking result, and taking the primary sentence-breaking result with the highest total sentence probability as a final sentence-breaking result.

Because the sentence-breaking model is obtained by training a conditional random field algorithm, which is a probability-based algorithm, in step 302, a plurality of primary sentence-breaking results from high to low according to the probability of accuracy can be obtained by inputting the speech data text into the sentence-breaking model.

The invention also provides a sentence-breaking method, which is to obtain the text of the sentence to be broken.

Specifically, the method comprises the following steps: acquiring voice data of a sentence to be punctuated;

and carrying out voice recognition on the voice data of the sentence to be punctuated, and taking a recognition result as the text of the sentence to be punctuated.

And then inputting the text to be punctuated into any punctuated sentence model obtained by training, thus completing punctuation of voice data.

The invention also provides computer equipment comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes any one of the text sentence break model establishing methods.

The invention also provides a computer storage medium, wherein the storage medium is stored with instructions, and the instructions execute any one of the text sentence break model establishing methods when running.

In view of the foregoing method, the present invention further provides a device for building a text sentence-breaking model, and fig. 4 shows a structural diagram of a device for building a text sentence-breaking model according to an aspect of the present invention.

The device comprises: a word segmentation module 401, configured to perform word segmentation on the training corpus to obtain words corresponding to the training corpus; a feature information adding module 402, configured to add feature information to the word, where the feature information includes pause information; the training module 403 is configured to train, by using a conditional random field algorithm, words corresponding to the training corpus based on the feature information of the words to obtain a text sentence-breaking model.

Because words are basic units formed by texts and discontinuous information does not appear in the middle of one word, the word segmentation module 401 performs word segmentation on the training corpus, so that the texts needing sentence segmentation are converted into corresponding words.

The feature information adding module 402 adds feature information to the word, the feature information including pause information. Firstly, the existing pause information of the words of the training corpus is used for training the sentence break model, namely, the pause rule of the words corresponding to the existing training corpus with the pause information is found out, and the sentence break model is built by utilizing the pause rule.

More preferably, the apparatus further comprises: the test module 404 is configured to perform sentence breaking on the test data by using a text sentence breaking model to obtain a sentence breaking result; an accuracy determining module 405, configured to determine whether an accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; and a parameter adjusting module 406, configured to, if the determination of the accuracy determining module 405 is negative, adjust the feature information frequency threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained through training after the feature information frequency threshold parameter and/or the fitting parameter is adjusted is greater than or equal to an accuracy threshold, and then use the text sentence breaking model obtained through training after the adjustment as a final text sentence breaking model.

In an embodiment, the test data is voice test data, and the test module further includes: the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text; sentence-breaking module: and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.

The recognized voice test data may not have pause information, and in an embodiment, the accuracy determining module further includes: the pause symbol adding module is used for adding pause symbols for the voice data texts; the pause information adding module is used for adding pause information to the voice data text based on the pause symbol; the calculation module is used for calculating the accuracy of the sentence-breaking result based on the pause information of the voice data text; and the judging module is used for judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.

If the corpus is speech data, in an embodiment, the corpus is a speech corpus, and the apparatus further includes: the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text; the word segmentation module is further used for carrying out word segmentation on the voice data text to obtain words corresponding to the voice data text; the characteristic information adding module is further used for adding pause symbols to the voice data text; and adding pause information to the voice data text based on the pause symbol.

In one embodiment, the feature information further includes: position information of the word and part-of-speech information of the word.

In an embodiment, the characteristic information adding module is further configured to: sentence component information is added to the words.

In an embodiment, the characteristic information adding module is further configured to: sentence component information is added to the word by parsing the word.

The sentence component information includes subjects, predicates, objects, determinants, subjects, complements, and the like.

In order to more efficiently select the attribute of the word and the range of the position information of the word when training for each word, in one embodiment, the training module is further configured to: and extracting the words and the feature information thereof corresponding to the feature template according to a preset feature template by using a conditional random field algorithm so as to train the training corpus and obtain the text sentence break model, wherein the preset feature template is used for representing the words and the feature information thereof, the relation of which with the trained current words meets the preset requirements.

In one embodiment, the relationship that the feature template represents to the trained current word includes any one or more of the following combinations of information: combining semantic information of the current word and pause information of the current word; combining the part-of-speech information of the current word and the pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; the semantic information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word are combined; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word and the pause information of the current word; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.

In an embodiment, the corpus includes pause symbols for identifying pause information of the corpus, and the feature information adding module is further configured to: adding pause information to the word based on the pause symbol of the corpus.

In an embodiment, the characteristic information adding module is further configured to: marking pause information of the word corresponding to the first training corpus before the pause symbol as a first mark; and marking the pause information of other words as second marks.

In one embodiment, the word segmentation module is further configured to: and performing word segmentation on the training corpus by using a word segmentation dictionary.

In one embodiment, the apparatus further comprises: and the new word discovery module is used for discovering new words of the training corpus and adding the obtained new words into the word segmentation dictionary.

Based on experimental experience, in one embodiment, the parameter adjustment module is further configured to: and adjusting the characteristic information frequency threshold parameter of the conditional random field algorithm within the numerical range of 1 to 5, and adjusting the fitting parameter of the conditional random field algorithm within the numerical range of 1 to 3.

In one embodiment, the test module is further configured to: performing sentence breaking on the test data by using the text sentence breaking model to obtain a plurality of primary sentence breaking results; and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training language data of standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.

In one embodiment, the total sentence probability of the primary sentence-breaking result is the product of the sentence-forming probabilities of the clauses of the primary sentence-breaking result.

The specific implementation manner and technical effect of the device for establishing a text sentence-breaking model can refer to the embodiment of the method for establishing a text sentence-breaking model, and are not described herein again.

The invention also provides a sentence-breaking device, comprising: the text acquisition module is used for acquiring a text of the sentence to be broken; and the sentence break module is used for inputting the text to be broken into a text sentence break model to obtain a sentence break result, wherein the text sentence break model is obtained by training by adopting the method for establishing the text sentence break model.

In one embodiment, the text obtaining module includes: the voice acquisition unit is used for acquiring voice data of the sentence to be disconnected; and the voice recognition unit is used for carrying out voice recognition on the voice data of the sentence to be broken and taking a recognition result as a text of the sentence to be broken.

The present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the sentence-breaking method.

The invention also provides a computer storage medium, wherein the storage medium is stored with instructions, and the instructions execute the sentence-breaking method when running.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for establishing a text sentence-breaking model is characterized by comprising the following steps:

performing word segmentation on a training corpus to obtain words corresponding to the training corpus;

adding characteristic information to the words, wherein the characteristic information comprises pause information;

training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain the text sentence breaking model; when model training is carried out, training is carried out on one word by one word;

the step of adding feature information to the word further comprises:

adding sentence component information to the word by syntactic analysis of the word;

adding semantic information to the word, wherein the semantic information is obtained through the character representation of the word;

the step of training further comprises:

extracting the words and the feature information thereof corresponding to a preset feature template by using a conditional random field algorithm to train the training corpus to obtain the text sentence break model, wherein the extracted feature information at least comprises the pause information, and the preset feature template is used for representing the words and the feature information thereof, the relation of which with the current word trained in the words meets the preset requirement;

the relationship that the feature template represents with the trained current word includes any one or more of the following combinations of information: semantic information of the current word and pause information of the current word; part-of-speech information of the current word and pause information of the current word; semantic information of a previous word, pause information of the previous word, semantic information of a current word and pause information of the current word; semantic information of the current word, pause information of the current word, part of speech information of the next word and pause information of the next word; part of speech information of the previous word, pause information of the previous word, part of speech information of the current word and pause information of the current word; the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.

2. The method of claim 1, wherein the method further comprises:

using the text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results;

judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold;

if not, adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after adjusting the feature information time threshold parameter and/or the fitting parameter on the training corpus is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after adjusting as the final text sentence breaking model.

3. The method of claim 2, wherein the test data is speech test data, and wherein the step of using the text sentence-breaking model to sentence the test data further comprises:

carrying out voice recognition on the voice test data to obtain a voice data text;

and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.

4. The method of claim 3, wherein the step of determining whether the accuracy of the sentence break result is greater than or equal to an accuracy threshold further comprises:

adding pause symbols to the voice data text;

adding pause information to the voice data text based on the pause symbol;

calculating the accuracy rate of the sentence-breaking result based on the pause information of the voice data text;

and judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.

5. The method of claim 1, wherein the corpus is speech corpus, the method further comprising:

the step of word segmentation further comprises:

performing word segmentation on the voice data text to obtain a word corresponding to the voice data text;

the step of adding the characteristic information further includes:

adding pause symbols to the voice data text;

and adding pause information to the voice data text based on the pause symbol.

6. The method of claim 1, wherein the feature information further comprises: position information of the word and part-of-speech information of the word.

7. The method according to claim 1, wherein the corpus includes pause symbols for identifying pause information of the corpus, said step of adding pause information to the word comprises:

adding pause information to the word based on the pause symbol of the corpus.

8. The method of claim 7, wherein the step of adding pause information for the word further comprises:

marking pause information of the word corresponding to the first training corpus before the pause symbol as a first mark;

and marking the pause information of other words as second marks.

9. The method of claim 1, wherein the step of tokenizing the corpus further comprises:

and performing word segmentation on the training corpus by using a word segmentation dictionary.

10. The method of claim 9, wherein the method further comprises:

and carrying out new word discovery on the training corpus, and adding the obtained new words into the word segmentation dictionary.

11. The method of claim 2 wherein said step of adjusting a feature information number threshold parameter and/or a fitting parameter of said conditional random field algorithm further comprises:

and adjusting the characteristic information frequency threshold parameter of the conditional random field algorithm within the numerical range of 1 to 5, and adjusting the fitting parameter of the conditional random field algorithm within the numerical range of 1 to 3.

12. The method of claim 2, wherein said step of using said text sentence-breaking model to make a sentence-breaking of test data further comprises:

performing sentence breaking on the test data by using the text sentence breaking model to obtain a plurality of primary sentence breaking results;

and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training language data of standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.

13. The method of claim 12 wherein the total sentence probability of the primary sentence-breaking result is the product of the sentence-making probabilities of the clauses of the primary sentence-breaking result.

14. A method of sentence punctuation, the method comprising:

obtaining a text of a sentence to be broken;

inputting the text to be punctuated into a text punctuation model to obtain punctuation results, wherein the text punctuation model is obtained by training with a method established by the text punctuation model according to any one of claims 1 to 13.

15. The method of claim 14, wherein the step of obtaining the text of the sentence to be punctuated further comprises:

acquiring voice data of a sentence to be punctuated;

16. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs a method of text sentence pattern creation as claimed in any one of claims 1 to 13.

17. A computer storage medium having stored thereon instructions, wherein the instructions when executed perform a method of text sentence modeling according to any of claims 1-13.

18. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs a method of sentence-breaking according to claim 14 or 15.

19. A computer storage medium having stored thereon instructions that when executed perform a sentence-breaking method according to claim 14 or 15.

20. An apparatus for text sentence-breaking model building, the apparatus comprising:

the word segmentation module is used for segmenting words of a training corpus to obtain words corresponding to the training corpus;

the characteristic information adding module is used for adding characteristic information to the words, and the characteristic information comprises pause information;

the training module is used for training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model; when model training is carried out, training is carried out on one word by one word;

the characteristic information adding module is further used for:

the training module is further to:

extracting the words and the feature information thereof corresponding to a preset feature template by using a conditional random field algorithm to train the training corpus to obtain the text sentence break model, wherein the preset feature template is used for representing the words and the feature information thereof which satisfy preset requirements in relation to the trained current words;

the relationship that the feature template represents with the trained current word includes any one or more of the following combinations of information: semantic information of the current word and pause information of the current word; part-of-speech information of the current word and pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; semantic information of the current word, pause information of the current word, part of speech information of the next word and pause information of the next word; part of speech information of the previous word, pause information of the previous word, part of speech information of the current word and pause information of the current word; the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.

21. The apparatus of claim 20, wherein the apparatus further comprises:

the test module is used for carrying out sentence breaking on test data by using the text sentence breaking model to obtain a sentence breaking result;

the accuracy judging module is used for judging whether the accuracy of the sentence breaking result is greater than or equal to an accuracy threshold value or not;

and the parameter adjusting module is used for adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm when the accuracy is smaller than the accuracy threshold value until the accuracy of a sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained by training after the feature information time threshold parameter and/or the fitting parameter is adjusted is larger than or equal to the accuracy threshold value, and then taking the text sentence breaking model obtained by training after the adjustment as a final text sentence breaking model.

22. The apparatus of claim 21, wherein the test data is voice test data, the test module further comprising:

the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text;

sentence-breaking module: and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.

23. The apparatus of claim 22, wherein the accuracy determination module further comprises:

the pause symbol adding module is used for adding pause symbols for the voice data texts;

the pause information adding module is used for adding pause information to the voice data text based on the pause symbol;

the calculation module is used for calculating the accuracy of the sentence-breaking result based on the pause information of the voice data text;

and the judging module is used for judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.

24. The apparatus of claim 20, wherein the corpus is speech corpus, the apparatus further comprising:

the word segmentation module is further used for carrying out word segmentation on the voice data text to obtain words corresponding to the voice data text;

the characteristic information adding module is further used for adding pause symbols to the voice data text; and adding pause information to the voice data text based on the pause symbol.

25. The apparatus of claim 20, wherein the characteristic information further comprises: position information of the word and part-of-speech information of the word.

26. The apparatus according to claim 20, wherein the corpus comprises pause symbols for identifying pause information of the corpus, the feature information adding module is further configured to: adding pause information to the word based on the pause symbol of the corpus.

27. The apparatus of claim 26, wherein the feature information addition module is further for:

and marking the pause information of other words as second marks.

28. The apparatus of claim 20, wherein the word segmentation module is further to:

29. The apparatus of claim 28, wherein the apparatus further comprises:

and the new word discovery module is used for discovering new words of the training corpus and adding the obtained new words into the word segmentation dictionary.

30. The apparatus of claim 21, wherein the parameter adjustment module is further to:

31. The apparatus of claim 21, wherein the testing module is further to:

32. The apparatus of claim 31, wherein the total sentence probability of the primary sentence-breaking result is the product of the sentence-making probabilities of the clauses of the primary sentence-breaking result.

33. An apparatus for sentence punctuation, the apparatus comprising:

the text acquisition module is used for acquiring a text of the sentence to be broken;

and the sentence break module is used for inputting the text of the sentence to be broken into a text sentence break model to obtain a sentence break result, wherein the text sentence break model is obtained by training by adopting the method for establishing the text sentence break model according to any one of claims 1 to 13.

34. The apparatus of claim 33, wherein the text acquisition module comprises:

the voice acquisition unit is used for acquiring voice data of the sentence to be disconnected;

and the voice recognition unit is used for carrying out voice recognition on the voice data of the sentence to be broken and taking a recognition result as a text of the sentence to be broken.