CN107247706B - Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment - Google Patents

Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment Download PDF

Info

Publication number
CN107247706B
CN107247706B CN201710458179.9A CN201710458179A CN107247706B CN 107247706 B CN107247706 B CN 107247706B CN 201710458179 A CN201710458179 A CN 201710458179A CN 107247706 B CN107247706 B CN 107247706B
Authority
CN
China
Prior art keywords
sentence
information
word
breaking
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710458179.9A
Other languages
Chinese (zh)
Other versions
CN107247706A (en
Inventor
谢瑜
张昊
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Shanghai Xiaoi Robot Technology Co Ltd
China Electronics Standardization Institute
Original Assignee
BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Shanghai Xiaoi Robot Technology Co Ltd
China Electronics Standardization Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD, Shanghai Xiaoi Robot Technology Co Ltd, China Electronics Standardization Institute filed Critical BEIJING SAIXI TECHNOLOGY DEVELOPMENT CO LTD
Priority to CN201710458179.9A priority Critical patent/CN107247706B/en
Publication of CN107247706A publication Critical patent/CN107247706A/en
Application granted granted Critical
Publication of CN107247706B publication Critical patent/CN107247706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text punctuation model establishing method, a punctuation method, a device and a computer device, wherein the text punctuation model establishing method comprises the following steps: performing word segmentation on a training corpus to obtain words corresponding to the training corpus; adding characteristic information to the words, wherein the characteristic information comprises pause information; and training words corresponding to the training corpus based on the characteristic information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model. Corresponding to the method, the invention also provides a sentence-breaking method, a sentence-breaking device and computer equipment.

Description

Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment
Technical Field
The invention relates to the technical field of intelligent interaction, in particular to a method and a device for establishing a text punctuation model.
Background
At present, communication interaction is more and more common in a voice mode, and after voice content is stored in a text form, the stored text usually has no punctuation marks or even has no discontinuous information, so that a barrier is provided for reading and understanding the stored text.
Disclosure of Invention
The invention provides a method for establishing a text sentence-breaking model, which can more accurately break sentences of data without pause information.
According to the above object, the present invention provides a method for establishing a text sentence-breaking model, wherein the method comprises: performing word segmentation on a training corpus to obtain words corresponding to the training corpus; adding characteristic information to the words, wherein the characteristic information comprises pause information; and training words corresponding to the training corpus based on the characteristic information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model.
In an embodiment, the method further comprises: using the text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results; judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; if not, adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after adjusting the feature information time threshold parameter and/or the fitting parameter on the training corpus is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after adjusting as the final text sentence breaking model.
Corresponding to the method, the invention also provides a device for establishing the text punctuation model, wherein the device comprises: the word segmentation module is used for segmenting words of a training corpus to obtain words corresponding to the training corpus; the characteristic information adding module is used for adding characteristic information to the words, and the characteristic information comprises pause information; and the training module is used for training the words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm so as to obtain the text punctuation model.
In one embodiment, the apparatus further comprises: the test module is used for carrying out sentence breaking on test data by using the text sentence breaking model to obtain a sentence breaking result; the accuracy judging module is used for judging whether the accuracy of the sentence breaking result is greater than or equal to an accuracy threshold value or not; if the determination of the accuracy determining module 305 is no, the apparatus further includes: and the parameter adjusting module is used for adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after the feature information time threshold parameter and/or the fitting parameter is adjusted on the training corpus is greater than or equal to the accuracy threshold, and then the text sentence breaking model obtained by training after the adjustment is used as the final text sentence breaking model.
The conditional random field algorithm is applied to the training of the sentence-breaking model, and the advantages of the conditional random field algorithm are fully applied, so that the sentence-breaking accuracy of the sentence-breaking model is higher.
Drawings
FIG. 1 is a flow diagram illustrating one aspect of a method of text sentence pattern creation of the present invention;
FIG. 2 is a flow diagram illustrating another aspect of a method of text sentence pattern modeling in accordance with the present invention;
FIG. 3 shows a flow chart of a method of sentence-breaking voice data;
FIG. 4 shows a schematic diagram of an apparatus for text sentence break model building in accordance with an aspect of the present invention.
Detailed Description
In order to add pause information to a text without pause marks, the invention provides a method for establishing a text punctuation model.
In one embodiment, referring to fig. 1, fig. 1 is a flow chart of an aspect of a method for text sentence break model building according to the present invention, the method comprising:
step 101: performing word segmentation on the training corpus to obtain words corresponding to the training corpus;
step 102: adding characteristic information to the word, wherein the characteristic information comprises pause information;
step 103: and training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain a text sentence break model.
The corpus is text data at least having pause information, the corpus may have punctuation marks, the punctuation marks are pause marks of the corpus, the punctuation marks may be based on the punctuation marks and on manual checking and labeling pause information, the corpus may not have punctuation marks, and only based on manual checking and labeling pause information. That is, the sentence-breaking model needs to be trained by using the data already having sentence-breaking information. Finding out the rule of sentence interruption information of the training corpus, and further establishing a sentence interruption model.
Because the sentences and words used in different fields have certain regularity, sentence-breaking models corresponding to the fields can be trained respectively aiming at the different fields, for example, different sentence-breaking models can be respectively established for the fields of telecom customer service, military, finance, science and technology and the like.
Because the words are basic units formed by the text and no discontinuous information appears in the middle of one word, the step 101 performs word segmentation on the training corpus, so that the text needing sentence segmentation is converted into corresponding words.
Words have many attributes, i.e., characteristics, such as part-of-speech, semantics, sentence components (e.g., subject, predicate, object, etc.), and so forth. These properties of a word are usually associated with whether the word is at a pause in a sentence, and by considering the above properties of the word and the interrelationship between the positions of the words, it can be derived which words should be at the pause in the sentence.
For example, the sentence "Shanghai general use a procurement agreement with U.S. general endorsement amount of $ 3.06 hundred million formally before the day, is used for purchasing the whole vehicle and parts of the latter. The "protocol" and "component" at the pause are both terms, that is, there is a high possibility that pause information will appear after the terms under certain conditions.
Step 102 is executed to add characteristic information to the word, wherein the characteristic information comprises pause information. Firstly, the existing pause information of the words of the training corpus is used for training the sentence break model, namely, the pause rule of the words corresponding to the existing training corpus with the pause information is found out, and the sentence break model is built by utilizing the pause rule.
Preferably, the sentence-breaking model is trained based on the pause information of the training corpus.
Preferably, the feature information further includes position information of the word and part-of-speech information of the word.
Preferably, the characteristic information further includes sentence component information.
The step of adding feature information to the word further comprises:
sentence component information is added to the words.
In particular, sentence component information may be added to the word by parsing the word.
In one embodiment, the pause information of the word corresponding to the first training corpus before the pause symbol is marked as a first mark; and marking the pause information of other words as a second mark.
For example, a purchase agreement on Shanghai general daily official and American general endorsement amounts to $ 3.06 million is used for purchasing the latter whole vehicle and parts. The results of word segmentation and pause information addition are shown in table 1.
TABLE 1 participle and add pause information
Word Pause information Word Pause information
General for Shanghai S Is/are as follows S
Day ahead S Procurement S
Formal form S Protocol E
And S for S
USA S Procurement S
Universal automobile S The latter being S
Sign the sign S Is/are as follows S
Amount of money S Whole vehicle S
To achieve S And S
3.06 billion dollars S Details of the components E
Wherein S represents that the word is not positioned at the pause of the text, and E represents that the word is positioned at the pause of the sentence.
Preferably, the sentence-breaking model is trained by simultaneously considering other feature information of the words, and it can be expected that the more feature information of the words is applied, the higher the accuracy of the trained sentence-breaking model is.
In one embodiment, the feature information added to the word further includes position information of the word and part-of-speech information of the word. The positional information of the predicated word, i.e., the correlation of the positions of the aforementioned words, is in the middle of "Shanghai general" and "formal" in the foregoing example. Of course, the range of the location information may be comprehensively considered as needed, and for example, the word before "usa" may be "and", the word after "general purpose car", the second word before "official", and the second word after "signed" may also be considered. The wider the range contained by the position information is, the higher the accuracy of sentence breaking of the trained sentence breaking model is.
Each word in table 1 already contains the mutual position information of each word, and the range of the contained position information may be selected as needed during training.
After the part-of-speech information is added to the word, the segmentation result with the pause information and the part-of-speech information is shown in table 2.
TABLE 2 word segmentation and addition of pause information and part-of-speech information
Figure BDA0001324254100000041
Figure BDA0001324254100000051
Wherein, the part-of-speech meaning represented by each English letter combination is as follows:
noun n, time word t, place word s, orientation word f, number word m, quantifier q, distinguishment word b, pronoun r, verb v, adjective a, state word z, adverb d, preposition word p, conjunctive word c, auxiliary word u, inflexion word y, exclamation word e, vocalism o, idiom i, idiom l, abbreviation j, concatenation element h, concatenation element k, morpheme g, non-morpheme character x and punctuation mark w.
Preferably, sentence component information is added to the words at the same time. The sentence component information includes subjects, predicates, objects, determinants, subjects, complements, and the like.
In one embodiment, sentence component information is added to a word by parsing the word.
In one embodiment, semantic information is added to a word, the semantic information being determined by the word's own meaning, which may be obtained from the word's textual representation. Different word expressions may have the same meaning and in one embodiment words having different word expressions but the same meaning are mapped to the same word.
Step 103 can be executed based on the attribute of the word and the position information of the word, and the word corresponding to the training corpus is trained by using the conditional random field algorithm to obtain a text sentence-breaking model. When model training is carried out, the training is carried out word by word.
In order to more efficiently select the attribute of each word and the range of position information of the word when training each word, in an embodiment, a conditional random field algorithm is used to extract the word and its feature information corresponding to a preset feature template according to the preset feature template, so as to train the training corpus to obtain the text sentence break model, wherein the extracted feature information at least includes pause information, and the preset feature template is used to represent the word and its feature information whose relationship with the current word to be trained satisfies a preset requirement.
That is, the feature template specifies the words used for training and the feature information corresponding to the words. In one embodiment, the relationship that the feature template represents to the trained current word includes any one or more of the following combinations of information: combining semantic information of the current word and pause information of the current word; combining the part-of-speech information of the current word and the pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; the semantic information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word are combined; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word and the pause information of the current word; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.
The semantic information is expressed by the text of the current word or can be obtained by matching the current word according to a preset semantic information matching method after word segmentation, such as matching a preset word vector, matching a preset synonym library, matching a preset near-sense thesaurus and the like. The semantic information can be represented by characters, word vectors, a preset synonym library in which the semantic information is located, a preset near synonym library in which the semantic information is located, and the like.
By adopting the characteristic template, the characteristic information of the word used for training not only comprises pause information, but also comprises position information of the word, part-of-speech information of the word and the like.
That is, one feature template specification is trained in consideration of only semantic information of a current word, another feature template specification is trained in consideration of part-of-speech information of the current word, and a template specification is trained in consideration of a combination of part-of-speech information of a previous word, part-of-speech information of the current word, and part-of-speech information of a next word. Of course, the feature templates do not necessarily include the above list, and various training scopes should be included in the scope of the present invention.
A code representation mode of the feature template is given below (the feature template includes pause information of a word to be extracted by default):
u03% x [0,0] # current word sense
U04% x [0,1] # current word part of speech
U05:% x [ -1,0 ]/% x [0,0] # combination of previous word semantics and current word semantics
U06:% x [0,0 ]/% x [1,1] # combination of current word semantics and next word part of speech
U20:%x[-2,1]/%x[-1,1]/%x[0,1]
The combination of the part of speech of the last word on the # and the parts of speech of the last word and the current word
U24:%x[-1,1]/%x[0,1]/%x[1,1]
Combination of part of speech of # previous word, part of speech of current word and part of speech of next word
The U03 code of% x [0,0] represents training the model by using semantic information of the current word, the U04 code of% x [0,1] represents training the model by using part-of-speech information of the current word, and the meanings of other codes are analogized.
When the model training is carried out, one characteristic template can be selected for training, a plurality of characteristic templates can also be selected for training at the same time, the more the number of the applied characteristic templates is, the more the considered words and the corresponding characteristic information are, the better the model training effect is, and the higher the sentence-breaking accuracy of the trained sentence-breaking model is.
To further improve the accuracy of the trained sentence break model, in one embodiment, please refer to fig. 2, fig. 2 shows a flowchart of another aspect of a method for building a text sentence break model according to the present invention, the method includes:
step 201: performing word segmentation on the training corpus to obtain words corresponding to the training corpus;
step 202: adding characteristic information to the word, wherein the characteristic information comprises pause information;
step 203: and training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain a text sentence break model.
Step 204: using a text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results;
step 205: judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; if not, go to step 206, and if so, go to step 208.
Step 206: adjusting a characteristic information time threshold parameter and/or a fitting parameter of the conditional random field algorithm;
step 207: judging that the accuracy of a sentence breaking result of a text sentence breaking model obtained by training after adjusting the feature information times threshold parameter and/or the fitting parameter is greater than or equal to an accuracy threshold, if not, returning to the step 206, and if yes, entering the step 208;
step 208: and (6) ending.
And after the sentence break model trained by the training corpus is obtained, testing the obtained sentence break model by using the test data. That is, the performance of the sentence-break model is tested by using the data of the known pause information.
For example, "Shanghai general use a procurement agreement formally with U.S. general endorsement in the amount of $ 3.06 billion before the day, for procurement of the latter's entire vehicle and parts, is known. "the punctuation information is that the punctuated sentence, that is, the" the purchasing agreement that the Shanghai general service reaches $ 3.06 million with the American general endorsement amount is used for purchasing the whole vehicle and parts of the latter "is used as the test data, the obtained punctuation model is input, the punctuation model outputs the punctuation result of the test data, and the punctuation result is compared with the known punctuation result, so that the accuracy of the punctuation model can be judged.
And when the accuracy is greater than or equal to the accuracy threshold, the accuracy of the sentence-breaking model is considered to be high enough, and the sentence-breaking model can be used for carrying out sentence breaking of other texts.
Of course, the testing process can be repeated, and the comprehensive accuracy of each test is comprehensively measured by a statistical method, and whether the comprehensive accuracy meets the requirement or not is judged.
And if the accuracy obtained by the test does not reach the accuracy threshold, adjusting the feature information frequency threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of a sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained by training after the feature information frequency threshold parameter and/or the fitting parameter is adjusted is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after the adjustment as a final text sentence breaking model.
The feature information frequency threshold parameter refers to a frequency threshold value of occurrence of a mutual relationship formed by feature information of words in the training corpus.
For example, when training corpus is performed by considering the feature template "a combination of the part of speech of the previous word, the part of speech of the current word and the part of speech of the next word", the relationship between the feature information and each other is "noun adverb noun", which means that the feature information relationship of each word in which the current word is an adverb and the previous word and the next word are nouns.
The fitting parameters refer to hyper-parameters in the conditional random field algorithm, the balance between over-fitting and non-fitting can be adjusted, and the larger the fitting parameter value is, the higher the degree of the conditional random field algorithm to fit the training data is.
Based on experimental experience, in one embodiment, the feature information times threshold parameter of the conditional random field algorithm is adjusted in a range of 1 to 5, and the fitting parameter of the conditional random field algorithm is adjusted in a range of 1 to 3.
In one embodiment, the sentence break model is tested by the input of speech. At this time, first, voice recognition is performed on voice test data, the voice data is converted into text data, i.e., a voice data text, and then a sentence break model is tested using the voice data text.
Preferably, pause symbols are added to the voice data text, and due to the fact that the test data are used, the pause symbols can be added manually, and the pause can be marked symbols.
With punctuation marks, pause information can be added to the voice data text, i.e. the place with punctuation marks is the place of pause.
By using the text with pause information, the method can be used for testing the sentence-breaking accuracy of the sentence-breaking model.
If the sentence-breaking model is tested by using the voice data, the previous processing procedure includes: carrying out voice recognition on the voice test data to obtain a voice data text; performing word segmentation on the voice data text to obtain words corresponding to the voice data text; adding pause symbols for the voice data text; based on the pause symbol, pause information is added to the voice data text.
In one embodiment, the word segmentation operation may be performed by using a word segmentation dictionary. The method includes the steps that a large number of words are recorded in a word segmentation dictionary, a text needing word segmentation is compared with each word in the word segmentation dictionary, and if the words existing in the word segmentation dictionary appear in the text, the corresponding characters are set as the words.
In one embodiment, new words are found in the training corpus, and the obtained new words are added into the word segmentation dictionary. When the training corpus is found to have words which are not in the word segmentation dictionary, the words can be distinguished in a manual mode or a new word finding method, and the words are added into the word segmentation dictionary.
In one embodiment, when a text sentence-breaking model is used to break a sentence on test data, the method comprises the following steps: performing sentence breaking on the test data by using a text sentence breaking model to obtain a plurality of primary sentence breaking results; and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training the language data of the standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.
Because the conditional random field algorithm is probability-based, a plurality of punctuation results can be output according to the size of the punctuation probability of punctuation through the punctuation model trained by the conditional random field algorithm.
At the moment, the total sentence probability of each primary sentence-breaking result is respectively calculated through an n-gram language model obtained through the language data training of the standard sentence-breaking, and the corresponding primary sentence-breaking result with the highest total sentence probability is used as the sentence-breaking result. Therefore, the final sentence-breaking result is determined through multi-level screening, and the sentence-breaking accuracy is improved.
The N-gram language model is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.
In one embodiment, the total sentence probability of the primary sentence-break result is the product of the sentence-forming probabilities of the clauses of the primary sentence-break result.
Turning to fig. 3, fig. 3 shows a flow chart of a method of sentence-breaking voice data.
Step 301: carrying out voice recognition on voice data to generate a voice data text;
step 302: inputting the voice data text into a sentence-breaking model to obtain a plurality of primary sentence-breaking results;
step 303: and inputting the plurality of primary sentence-breaking results into the n-gram language model, judging the total sentence probability of each primary sentence-breaking result, and taking the primary sentence-breaking result with the highest total sentence probability as a final sentence-breaking result.
Because the sentence-breaking model is obtained by training a conditional random field algorithm, which is a probability-based algorithm, in step 302, a plurality of primary sentence-breaking results from high to low according to the probability of accuracy can be obtained by inputting the speech data text into the sentence-breaking model.
The invention also provides a sentence-breaking method, which is to obtain the text of the sentence to be broken.
Specifically, the method comprises the following steps: acquiring voice data of a sentence to be punctuated;
and carrying out voice recognition on the voice data of the sentence to be punctuated, and taking a recognition result as the text of the sentence to be punctuated.
And then inputting the text to be punctuated into any punctuated sentence model obtained by training, thus completing punctuation of voice data.
The invention also provides computer equipment comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes any one of the text sentence break model establishing methods.
The invention also provides a computer storage medium, wherein the storage medium is stored with instructions, and the instructions execute any one of the text sentence break model establishing methods when running.
In view of the foregoing method, the present invention further provides a device for building a text sentence-breaking model, and fig. 4 shows a structural diagram of a device for building a text sentence-breaking model according to an aspect of the present invention.
The device comprises: a word segmentation module 401, configured to perform word segmentation on the training corpus to obtain words corresponding to the training corpus; a feature information adding module 402, configured to add feature information to the word, where the feature information includes pause information; the training module 403 is configured to train, by using a conditional random field algorithm, words corresponding to the training corpus based on the feature information of the words to obtain a text sentence-breaking model.
Because words are basic units formed by texts and discontinuous information does not appear in the middle of one word, the word segmentation module 401 performs word segmentation on the training corpus, so that the texts needing sentence segmentation are converted into corresponding words.
The feature information adding module 402 adds feature information to the word, the feature information including pause information. Firstly, the existing pause information of the words of the training corpus is used for training the sentence break model, namely, the pause rule of the words corresponding to the existing training corpus with the pause information is found out, and the sentence break model is built by utilizing the pause rule.
And after the sentence break model trained by the training corpus is obtained, testing the obtained sentence break model by using the test data. That is, the performance of the sentence-break model is tested by using the data of the known pause information.
More preferably, the apparatus further comprises: the test module 404 is configured to perform sentence breaking on the test data by using a text sentence breaking model to obtain a sentence breaking result; an accuracy determining module 405, configured to determine whether an accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold; and a parameter adjusting module 406, configured to, if the determination of the accuracy determining module 405 is negative, adjust the feature information frequency threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained through training after the feature information frequency threshold parameter and/or the fitting parameter is adjusted is greater than or equal to an accuracy threshold, and then use the text sentence breaking model obtained through training after the adjustment as a final text sentence breaking model.
In an embodiment, the test data is voice test data, and the test module further includes: the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text; sentence-breaking module: and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.
The recognized voice test data may not have pause information, and in an embodiment, the accuracy determining module further includes: the pause symbol adding module is used for adding pause symbols for the voice data texts; the pause information adding module is used for adding pause information to the voice data text based on the pause symbol; the calculation module is used for calculating the accuracy of the sentence-breaking result based on the pause information of the voice data text; and the judging module is used for judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.
If the corpus is speech data, in an embodiment, the corpus is a speech corpus, and the apparatus further includes: the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text; the word segmentation module is further used for carrying out word segmentation on the voice data text to obtain words corresponding to the voice data text; the characteristic information adding module is further used for adding pause symbols to the voice data text; and adding pause information to the voice data text based on the pause symbol.
In one embodiment, the feature information further includes: position information of the word and part-of-speech information of the word.
In an embodiment, the characteristic information adding module is further configured to: sentence component information is added to the words.
In an embodiment, the characteristic information adding module is further configured to: sentence component information is added to the word by parsing the word.
The sentence component information includes subjects, predicates, objects, determinants, subjects, complements, and the like.
In order to more efficiently select the attribute of the word and the range of the position information of the word when training for each word, in one embodiment, the training module is further configured to: and extracting the words and the feature information thereof corresponding to the feature template according to a preset feature template by using a conditional random field algorithm so as to train the training corpus and obtain the text sentence break model, wherein the preset feature template is used for representing the words and the feature information thereof, the relation of which with the trained current words meets the preset requirements.
In one embodiment, the relationship that the feature template represents to the trained current word includes any one or more of the following combinations of information: combining semantic information of the current word and pause information of the current word; combining the part-of-speech information of the current word and the pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; the semantic information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word are combined; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word and the pause information of the current word; the combination of the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.
In an embodiment, the corpus includes pause symbols for identifying pause information of the corpus, and the feature information adding module is further configured to: adding pause information to the word based on the pause symbol of the corpus.
In an embodiment, the characteristic information adding module is further configured to: marking pause information of the word corresponding to the first training corpus before the pause symbol as a first mark; and marking the pause information of other words as second marks.
In one embodiment, the word segmentation module is further configured to: and performing word segmentation on the training corpus by using a word segmentation dictionary.
In one embodiment, the apparatus further comprises: and the new word discovery module is used for discovering new words of the training corpus and adding the obtained new words into the word segmentation dictionary.
Based on experimental experience, in one embodiment, the parameter adjustment module is further configured to: and adjusting the characteristic information frequency threshold parameter of the conditional random field algorithm within the numerical range of 1 to 5, and adjusting the fitting parameter of the conditional random field algorithm within the numerical range of 1 to 3.
In one embodiment, the test module is further configured to: performing sentence breaking on the test data by using the text sentence breaking model to obtain a plurality of primary sentence breaking results; and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training language data of standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.
Because the conditional random field algorithm is probability-based, a plurality of punctuation results can be output according to the size of the punctuation probability of punctuation through the punctuation model trained by the conditional random field algorithm.
At the moment, the total sentence probability of each primary sentence-breaking result is respectively calculated through an n-gram language model obtained through the language data training of the standard sentence-breaking, and the corresponding primary sentence-breaking result with the highest total sentence probability is used as the sentence-breaking result. Therefore, the final sentence-breaking result is determined through multi-level screening, and the sentence-breaking accuracy is improved.
In one embodiment, the total sentence probability of the primary sentence-breaking result is the product of the sentence-forming probabilities of the clauses of the primary sentence-breaking result.
The specific implementation manner and technical effect of the device for establishing a text sentence-breaking model can refer to the embodiment of the method for establishing a text sentence-breaking model, and are not described herein again.
The invention also provides a sentence-breaking device, comprising: the text acquisition module is used for acquiring a text of the sentence to be broken; and the sentence break module is used for inputting the text to be broken into a text sentence break model to obtain a sentence break result, wherein the text sentence break model is obtained by training by adopting the method for establishing the text sentence break model.
In one embodiment, the text obtaining module includes: the voice acquisition unit is used for acquiring voice data of the sentence to be disconnected; and the voice recognition unit is used for carrying out voice recognition on the voice data of the sentence to be broken and taking a recognition result as a text of the sentence to be broken.
The present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the sentence-breaking method.
The invention also provides a computer storage medium, wherein the storage medium is stored with instructions, and the instructions execute the sentence-breaking method when running.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (34)

1. A method for establishing a text sentence-breaking model is characterized by comprising the following steps:
performing word segmentation on a training corpus to obtain words corresponding to the training corpus;
adding characteristic information to the words, wherein the characteristic information comprises pause information;
training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain the text sentence breaking model; when model training is carried out, training is carried out on one word by one word;
the step of adding feature information to the word further comprises:
adding sentence component information to the word by syntactic analysis of the word;
adding semantic information to the word, wherein the semantic information is obtained through the character representation of the word;
the step of training further comprises:
extracting the words and the feature information thereof corresponding to a preset feature template by using a conditional random field algorithm to train the training corpus to obtain the text sentence break model, wherein the extracted feature information at least comprises the pause information, and the preset feature template is used for representing the words and the feature information thereof, the relation of which with the current word trained in the words meets the preset requirement;
the relationship that the feature template represents with the trained current word includes any one or more of the following combinations of information: semantic information of the current word and pause information of the current word; part-of-speech information of the current word and pause information of the current word; semantic information of a previous word, pause information of the previous word, semantic information of a current word and pause information of the current word; semantic information of the current word, pause information of the current word, part of speech information of the next word and pause information of the next word; part of speech information of the previous word, pause information of the previous word, part of speech information of the current word and pause information of the current word; the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.
2. The method of claim 1, wherein the method further comprises:
using the text sentence-breaking model to break sentences of the test data to obtain sentence-breaking results;
judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold;
if not, adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm until the accuracy of the sentence breaking result of the text sentence breaking model obtained by training after adjusting the feature information time threshold parameter and/or the fitting parameter on the training corpus is greater than or equal to the accuracy threshold, and taking the text sentence breaking model obtained by training after adjusting as the final text sentence breaking model.
3. The method of claim 2, wherein the test data is speech test data, and wherein the step of using the text sentence-breaking model to sentence the test data further comprises:
carrying out voice recognition on the voice test data to obtain a voice data text;
and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.
4. The method of claim 3, wherein the step of determining whether the accuracy of the sentence break result is greater than or equal to an accuracy threshold further comprises:
adding pause symbols to the voice data text;
adding pause information to the voice data text based on the pause symbol;
calculating the accuracy rate of the sentence-breaking result based on the pause information of the voice data text;
and judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.
5. The method of claim 1, wherein the corpus is speech corpus, the method further comprising:
carrying out voice recognition on the voice test data to obtain a voice data text;
the step of word segmentation further comprises:
performing word segmentation on the voice data text to obtain a word corresponding to the voice data text;
the step of adding the characteristic information further includes:
adding pause symbols to the voice data text;
and adding pause information to the voice data text based on the pause symbol.
6. The method of claim 1, wherein the feature information further comprises: position information of the word and part-of-speech information of the word.
7. The method according to claim 1, wherein the corpus includes pause symbols for identifying pause information of the corpus, said step of adding pause information to the word comprises:
adding pause information to the word based on the pause symbol of the corpus.
8. The method of claim 7, wherein the step of adding pause information for the word further comprises:
marking pause information of the word corresponding to the first training corpus before the pause symbol as a first mark;
and marking the pause information of other words as second marks.
9. The method of claim 1, wherein the step of tokenizing the corpus further comprises:
and performing word segmentation on the training corpus by using a word segmentation dictionary.
10. The method of claim 9, wherein the method further comprises:
and carrying out new word discovery on the training corpus, and adding the obtained new words into the word segmentation dictionary.
11. The method of claim 2 wherein said step of adjusting a feature information number threshold parameter and/or a fitting parameter of said conditional random field algorithm further comprises:
and adjusting the characteristic information frequency threshold parameter of the conditional random field algorithm within the numerical range of 1 to 5, and adjusting the fitting parameter of the conditional random field algorithm within the numerical range of 1 to 3.
12. The method of claim 2, wherein said step of using said text sentence-breaking model to make a sentence-breaking of test data further comprises:
performing sentence breaking on the test data by using the text sentence breaking model to obtain a plurality of primary sentence breaking results;
and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training language data of standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.
13. The method of claim 12 wherein the total sentence probability of the primary sentence-breaking result is the product of the sentence-making probabilities of the clauses of the primary sentence-breaking result.
14. A method of sentence punctuation, the method comprising:
obtaining a text of a sentence to be broken;
inputting the text to be punctuated into a text punctuation model to obtain punctuation results, wherein the text punctuation model is obtained by training with a method established by the text punctuation model according to any one of claims 1 to 13.
15. The method of claim 14, wherein the step of obtaining the text of the sentence to be punctuated further comprises:
acquiring voice data of a sentence to be punctuated;
and carrying out voice recognition on the voice data of the sentence to be punctuated, and taking a recognition result as the text of the sentence to be punctuated.
16. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs a method of text sentence pattern creation as claimed in any one of claims 1 to 13.
17. A computer storage medium having stored thereon instructions, wherein the instructions when executed perform a method of text sentence modeling according to any of claims 1-13.
18. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs a method of sentence-breaking according to claim 14 or 15.
19. A computer storage medium having stored thereon instructions that when executed perform a sentence-breaking method according to claim 14 or 15.
20. An apparatus for text sentence-breaking model building, the apparatus comprising:
the word segmentation module is used for segmenting words of a training corpus to obtain words corresponding to the training corpus;
the characteristic information adding module is used for adding characteristic information to the words, and the characteristic information comprises pause information;
the training module is used for training words corresponding to the training corpus based on the feature information of the words by using a conditional random field algorithm to obtain the text sentence-breaking model; when model training is carried out, training is carried out on one word by one word;
the characteristic information adding module is further used for:
adding sentence component information to the word by syntactic analysis of the word;
adding semantic information to the word, wherein the semantic information is obtained through the character representation of the word;
the training module is further to:
extracting the words and the feature information thereof corresponding to a preset feature template by using a conditional random field algorithm to train the training corpus to obtain the text sentence break model, wherein the preset feature template is used for representing the words and the feature information thereof which satisfy preset requirements in relation to the trained current words;
the relationship that the feature template represents with the trained current word includes any one or more of the following combinations of information: semantic information of the current word and pause information of the current word; part-of-speech information of the current word and pause information of the current word; the semantic information of the previous word, the pause information of the previous word, the semantic information of the current word and the pause information of the current word; semantic information of the current word, pause information of the current word, part of speech information of the next word and pause information of the next word; part of speech information of the previous word, pause information of the previous word, part of speech information of the current word and pause information of the current word; the part of speech information of the previous word, the pause information of the previous word, the part of speech information of the current word, the pause information of the current word, the part of speech information of the next word and the pause information of the next word.
21. The apparatus of claim 20, wherein the apparatus further comprises:
the test module is used for carrying out sentence breaking on test data by using the text sentence breaking model to obtain a sentence breaking result;
the accuracy judging module is used for judging whether the accuracy of the sentence breaking result is greater than or equal to an accuracy threshold value or not;
and the parameter adjusting module is used for adjusting the feature information time threshold parameter and/or the fitting parameter of the conditional random field algorithm when the accuracy is smaller than the accuracy threshold value until the accuracy of a sentence breaking result of the sentence breaking of the training corpus by the text sentence breaking model obtained by training after the feature information time threshold parameter and/or the fitting parameter is adjusted is larger than or equal to the accuracy threshold value, and then taking the text sentence breaking model obtained by training after the adjustment as a final text sentence breaking model.
22. The apparatus of claim 21, wherein the test data is voice test data, the test module further comprising:
the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text;
sentence-breaking module: and carrying out sentence breaking on the voice data text by using the text sentence breaking model to obtain a sentence breaking result.
23. The apparatus of claim 22, wherein the accuracy determination module further comprises:
the pause symbol adding module is used for adding pause symbols for the voice data texts;
the pause information adding module is used for adding pause information to the voice data text based on the pause symbol;
the calculation module is used for calculating the accuracy of the sentence-breaking result based on the pause information of the voice data text;
and the judging module is used for judging whether the accuracy of the sentence-breaking result is greater than or equal to an accuracy threshold.
24. The apparatus of claim 20, wherein the corpus is speech corpus, the apparatus further comprising:
the voice recognition module is used for carrying out voice recognition on the voice test data to obtain a voice data text;
the word segmentation module is further used for carrying out word segmentation on the voice data text to obtain words corresponding to the voice data text;
the characteristic information adding module is further used for adding pause symbols to the voice data text; and adding pause information to the voice data text based on the pause symbol.
25. The apparatus of claim 20, wherein the characteristic information further comprises: position information of the word and part-of-speech information of the word.
26. The apparatus according to claim 20, wherein the corpus comprises pause symbols for identifying pause information of the corpus, the feature information adding module is further configured to: adding pause information to the word based on the pause symbol of the corpus.
27. The apparatus of claim 26, wherein the feature information addition module is further for:
marking pause information of the word corresponding to the first training corpus before the pause symbol as a first mark;
and marking the pause information of other words as second marks.
28. The apparatus of claim 20, wherein the word segmentation module is further to:
and performing word segmentation on the training corpus by using a word segmentation dictionary.
29. The apparatus of claim 28, wherein the apparatus further comprises:
and the new word discovery module is used for discovering new words of the training corpus and adding the obtained new words into the word segmentation dictionary.
30. The apparatus of claim 21, wherein the parameter adjustment module is further to:
and adjusting the characteristic information frequency threshold parameter of the conditional random field algorithm within the numerical range of 1 to 5, and adjusting the fitting parameter of the conditional random field algorithm within the numerical range of 1 to 3.
31. The apparatus of claim 21, wherein the testing module is further to:
performing sentence breaking on the test data by using the text sentence breaking model to obtain a plurality of primary sentence breaking results;
and respectively calculating the total sentence probability of each primary sentence-breaking result by using an n-gram language model obtained by training language data of standard sentence-breaking, and taking the corresponding primary sentence-breaking result with the highest total sentence probability as the sentence-breaking result.
32. The apparatus of claim 31, wherein the total sentence probability of the primary sentence-breaking result is the product of the sentence-making probabilities of the clauses of the primary sentence-breaking result.
33. An apparatus for sentence punctuation, the apparatus comprising:
the text acquisition module is used for acquiring a text of the sentence to be broken;
and the sentence break module is used for inputting the text of the sentence to be broken into a text sentence break model to obtain a sentence break result, wherein the text sentence break model is obtained by training by adopting the method for establishing the text sentence break model according to any one of claims 1 to 13.
34. The apparatus of claim 33, wherein the text acquisition module comprises:
the voice acquisition unit is used for acquiring voice data of the sentence to be disconnected;
and the voice recognition unit is used for carrying out voice recognition on the voice data of the sentence to be broken and taking a recognition result as a text of the sentence to be broken.
CN201710458179.9A 2017-06-16 2017-06-16 Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment Active CN107247706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710458179.9A CN107247706B (en) 2017-06-16 2017-06-16 Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710458179.9A CN107247706B (en) 2017-06-16 2017-06-16 Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment

Publications (2)

Publication Number Publication Date
CN107247706A CN107247706A (en) 2017-10-13
CN107247706B true CN107247706B (en) 2021-06-25

Family

ID=60018228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710458179.9A Active CN107247706B (en) 2017-06-16 2017-06-16 Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment

Country Status (1)

Country Link
CN (1) CN107247706B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844480B (en) * 2017-10-21 2021-04-30 科大讯飞股份有限公司 Method and system for converting written text into spoken text
CN109979435B (en) * 2017-12-28 2021-10-22 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN110209807A (en) 2018-07-03 2019-09-06 腾讯科技(深圳)有限公司 A kind of method of event recognition, the method for model training, equipment and storage medium
CN108831481A (en) * 2018-08-01 2018-11-16 平安科技(深圳)有限公司 Symbol adding method, device, computer equipment and storage medium in speech recognition
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing
CN111160004B (en) * 2018-11-07 2023-06-27 北京猎户星空科技有限公司 Method and device for establishing sentence-breaking model
CN109461438B (en) * 2018-12-19 2022-06-14 合肥讯飞数码科技有限公司 Voice recognition method, device, equipment and storage medium
CN109684638B (en) * 2018-12-24 2023-08-11 北京金山安全软件有限公司 Clause method and device, electronic equipment and computer readable storage medium
CN109783648B (en) * 2018-12-28 2020-12-29 北京声智科技有限公司 Method for improving ASR language model by using ASR recognition result
CN109637537B (en) * 2018-12-28 2020-06-30 北京声智科技有限公司 Method for automatically acquiring annotated data to optimize user-defined awakening model
CN110209446B (en) * 2019-04-23 2021-10-01 华为技术有限公司 Method and device for configuring combined slot in man-machine conversation system
CN110264997A (en) * 2019-05-30 2019-09-20 北京百度网讯科技有限公司 The method, apparatus and storage medium of voice punctuate
CN110619868B (en) * 2019-08-29 2021-12-17 深圳市优必选科技股份有限公司 Voice assistant optimization method, voice assistant optimization device and intelligent equipment
CN110705254B (en) * 2019-09-27 2023-04-07 科大讯飞股份有限公司 Text sentence-breaking method and device, electronic equipment and storage medium
CN111259163A (en) * 2020-01-14 2020-06-09 北京明略软件系统有限公司 Knowledge graph generation method and device and computer readable storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN112002328B (en) * 2020-08-10 2024-04-16 中央广播电视总台 Subtitle generation method and device, computer storage medium and electronic equipment
CN112307167A (en) * 2020-10-30 2021-02-02 广州华多网络科技有限公司 Text sentence cutting method and device, computer equipment and storage medium
CN114613357A (en) * 2020-12-04 2022-06-10 广东博智林机器人有限公司 Voice processing method, system, electronic device and storage medium
CN112786023B (en) * 2020-12-23 2024-07-02 竹间智能科技(上海)有限公司 Mark model construction method and voice broadcasting system
CN113970910B (en) * 2021-09-30 2024-03-19 中国电子技术标准化研究院 Digital twin equipment construction method and system
CN115579009B (en) * 2022-12-06 2023-04-07 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN105244022A (en) * 2015-09-28 2016-01-13 科大讯飞股份有限公司 Audio and video subtitle generation method and apparatus
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN106331893A (en) * 2016-08-31 2017-01-11 科大讯飞股份有限公司 Real-time subtitle display method and system
US9645988B1 (en) * 2016-08-25 2017-05-09 Kira Inc. System and method for identifying passages in electronic documents

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092424B2 (en) * 2009-09-30 2015-07-28 Microsoft Technology Licensing, Llc Webpage entity extraction through joint understanding of page structures and sentences
CN104750687B (en) * 2013-12-25 2018-03-20 株式会社东芝 Improve method and device, machine translation method and the device of bilingualism corpora
CN104598510A (en) * 2014-10-16 2015-05-06 苏州大学 Event trigger word recognition method and device
CN105718586B (en) * 2016-01-26 2018-12-28 中国人民解放军国防科学技术大学 The method and device of participle
CN105957518B (en) * 2016-06-16 2019-05-31 内蒙古大学 A kind of method of Mongol large vocabulary continuous speech recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN103077164A (en) * 2012-12-27 2013-05-01 新浪网技术(中国)有限公司 Text analysis method and text analyzer
CN105244022A (en) * 2015-09-28 2016-01-13 科大讯飞股份有限公司 Audio and video subtitle generation method and apparatus
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
US9645988B1 (en) * 2016-08-25 2017-05-09 Kira Inc. System and method for identifying passages in electronic documents
CN106331893A (en) * 2016-08-31 2017-01-11 科大讯飞股份有限公司 Real-time subtitle display method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于条件随机场的古汉语自动断句与标点方法;张开旭 等;《清华大学学报(自然科学版)》;20091030;第49卷(第10期);第1733-1736页 *
张开旭 等.基于条件随机场的古汉语自动断句与标点方法.《清华大学学报(自然科学版)》.2009,第49卷(第10期),第1733-1736页. *

Also Published As

Publication number Publication date
CN107247706A (en) 2017-10-13

Similar Documents

Publication Publication Date Title
CN107247706B (en) Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment
CN109145282B (en) Sentence-breaking model training method, sentence-breaking device and computer equipment
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
JP5901001B1 (en) Method and device for acoustic language model training
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US9575955B2 (en) Method of detecting grammatical error, error detecting apparatus for the method, and computer-readable recording medium storing the method
US10741092B1 (en) Application of high-dimensional linguistic and semantic feature vectors in automated scoring of examination responses
CN104484322A (en) Methods and systems for automated text correction
WO2021208460A1 (en) Sentence completion method and device, and readable storage medium
CN111104803B (en) Semantic understanding processing method, device, equipment and readable storage medium
US20240028650A1 (en) Method, apparatus, and computer-readable medium for determining a data domain associated with data
Dien et al. POS-tagger for English-Vietnamese bilingual corpus
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN115017870A (en) Closed-loop dialect expanding writing method and device, computer equipment and storage medium
KR102206781B1 (en) Method of fake news evaluation based on knowledge-based inference, recording medium and apparatus for performing the method
Chennoufi et al. Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization
CN109977391B (en) Information extraction method and device for text data
CN112487813B (en) Named entity recognition method and system, electronic equipment and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium
Hahn et al. Optimizing CRFs for SLU tasks in various languages using modified training criteria
CN114707489B (en) Method and device for acquiring annotation data set, electronic equipment and storage medium
Hamza et al. Identification of sentence context based on thematic role rules for Malay short essay assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant