CN111160004A - Method and device for establishing sentence-breaking model - Google Patents

Method and device for establishing sentence-breaking model Download PDF

Info

Publication number
CN111160004A
CN111160004A CN201811320993.5A CN201811320993A CN111160004A CN 111160004 A CN111160004 A CN 111160004A CN 201811320993 A CN201811320993 A CN 201811320993A CN 111160004 A CN111160004 A CN 111160004A
Authority
CN
China
Prior art keywords
sentence
word
corpus
deep learning
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811320993.5A
Other languages
Chinese (zh)
Other versions
CN111160004B (en
Inventor
李晓普
王阳阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN201811320993.5A priority Critical patent/CN111160004B/en
Publication of CN111160004A publication Critical patent/CN111160004A/en
Application granted granted Critical
Publication of CN111160004B publication Critical patent/CN111160004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a sentence break model establishing method and a sentence break model establishing device, wherein the method comprises the following steps: performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence; determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm; inputting a word sequence formed by the words obtained after word segmentation and segmentation into a deep learning model for sentence segmentation and annotation; according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, the parameters of the deep learning model are adjusted, and the sentence break model is established, so that the sentence without the sentence break identification is subjected to sentence break processing by using the established sentence break model, the sentence without the sentence break identification can not be very long, the readability and the understandability of the sentence are improved, and the user experience can be improved.

Description

Method and device for establishing sentence-breaking model
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for establishing a sentence-break model.
Background
In recent years, with the rapid development of voice recognition technology, the application fields of voice recognition are increasing, such as sending voice messages, voice memos, simultaneous interpretation and the like.
However, when the speech information is recognized to obtain a corresponding sentence (meaning of the sentence is complete or incomplete), the sentence is not provided with a sentence break identifier, which is relatively unfavorable for reading and understanding, and when the speech information is relatively long and the recognized sentence contains more characters, the user is very labored to read, so that relevant people begin to research how to perform sentence break on the sentence without the sentence break identifier, so as to improve user experience.
Disclosure of Invention
The embodiment of the application provides a method and a device for establishing a deep learning model for sentence break, which are used for carrying out sentence break on sentences without sentence break marks and improving user experience.
In a first aspect, a method for establishing a sentence-break model provided in an embodiment of the present application includes:
performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence;
determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm;
inputting a word sequence formed by the words obtained after word segmentation and segmentation into a deep learning model for sentence segmentation and annotation;
and adjusting parameters of the deep learning model according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence break model.
By adopting the scheme, after the sentence break model is established, the sentence without the sentence break mark can be subjected to sentence break processing by utilizing the established sentence break model, so that the user can not see a very long sentence without the sentence break mark any more, the readability and the intelligibility of the sentence are improved, and the user experience can be improved.
In one possible embodiment, the corpus sentences are obtained according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence break marks;
splicing partial or all sample sentences;
and segmenting each spliced sample sentence, and determining the segmented sample sentence as the corpus sentence.
In one possible implementation, segmenting each spliced sample sentence includes:
dividing each spliced sample sentence according to a set step length; or
And randomly segmenting each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte pair encoding BPE algorithm.
In one possible implementation, the deep learning model is controlled to label each word in the sequence of words according to the following steps:
analyzing the context information of the word in the word sequence;
determining a first probability of marking the punctuation mark of the word and a second probability of marking the non-punctuation mark of the word according to the context information of the word;
and selecting the identifier with the maximum median of the first probability and the second probability to label the word.
In a possible implementation manner, after determining the first probability of marking the punctuation mark of the word and the second probability of marking the non-punctuation mark of the word according to the context information of the word, the deep learning model is further controlled to execute:
adjusting the first probability and the second probability according to the labeling condition of each labeled word in the word sequence; and
selecting the identifier with the maximum median of the first probability and the second probability to label the word, wherein the selecting step comprises the following steps:
and selecting the identifier with the maximum median of the adjusted first probability and the adjusted second probability to label the word.
In a possible implementation manner, adjusting parameters of the deep learning model according to an original sentence break identifier of each corpus sentence and a sentence break label corresponding to the corpus sentence output by the deep learning model includes:
and comparing the position of the original sentence break identifier of each corpus sentence with the position of the sentence break label corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the position of the sentence break label corresponding to the corpus sentence output by the deep learning model after adjustment is the same as the position of the original sentence break identifier of the corpus sentence.
In a possible implementation, after adjusting the parameters of the deep learning model, the method further includes:
inputting at least one test sentence into the adjusted deep learning model for sentence segmentation marking;
determining the marking accuracy rate of the adjusted deep learning model according to the original sentence break identification of the test sentence and the sentence break marking corresponding to the test sentence output by the adjusted deep learning model;
if the determined marking accuracy is greater than or equal to a preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined marking accuracy is smaller than the preset accuracy, training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence break model.
In a possible implementation, after the sentence break model is established, the method further includes:
carrying out sentence-breaking processing on an input character sequence by using the sentence-breaking model, wherein the character sequence is obtained by carrying out voice recognition processing on a collected voice signal;
and outputting the character sequence after sentence break processing.
In a second aspect, an apparatus for establishing a sentence-break model provided in an embodiment of the present application includes:
the preprocessing module is used for carrying out word segmentation processing on each acquired corpus sentence and determining words contained in the corpus sentence; determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm;
the marking module is used for inputting a word sequence formed by the words obtained after the word segmentation processing and the segmentation processing into the deep learning model for sentence breaking and marking;
and the adjusting module is used for adjusting parameters of the deep learning model according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence break model.
In a possible implementation manner, the preprocessing module is specifically configured to obtain the corpus sentences according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence break marks;
splicing partial or all sample sentences;
and segmenting each spliced sample sentence, and determining the segmented sample sentence as the corpus sentence.
In a possible implementation, the preprocessing module is specifically configured to:
dividing each spliced sample sentence according to a set step length; or
And randomly segmenting each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte pair encoding BPE algorithm.
In a possible implementation, the labeling module is specifically configured to control the deep learning model to label each word in the word sequence according to the following steps:
analyzing the context information of the word in the word sequence;
determining a first probability of marking the punctuation mark of the word and a second probability of marking the non-punctuation mark of the word according to the context information of the word;
and selecting the identifier with the maximum median of the first probability and the second probability to label the word.
In a possible implementation, the annotation module further controls the deep learning model to perform:
after determining a first probability of marking a punctuation mark on the word and a second probability of marking a non-punctuation mark on the word according to the context information of the word, adjusting the first probability and the second probability according to the marking condition of each marked word in the word sequence;
and selecting the identifier with the maximum median of the adjusted first probability and the adjusted second probability to label the word.
In a possible implementation, the adjusting module is specifically configured to:
and comparing the position of the original sentence break identifier of each corpus sentence with the position of the sentence break label corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the position of the sentence break label corresponding to the corpus sentence output by the deep learning model after adjustment is the same as the position of the original sentence break identifier of the corpus sentence.
In a possible implementation, the system further includes a test module, configured to:
after adjusting the parameters of the deep learning model, inputting at least one test sentence into the adjusted deep learning model for sentence segmentation marking;
determining the marking accuracy rate of the adjusted deep learning model according to the original sentence break identification of the test sentence and the sentence break marking corresponding to the test sentence output by the adjusted deep learning model;
if the determined marking accuracy is greater than or equal to a preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined marking accuracy is smaller than the preset accuracy, triggering the preprocessing module, the marking module and the adjusting module, and training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence breaking model.
In a possible implementation, the system further includes a sentence-breaking module, configured to:
after a sentence-breaking model is established, sentence-breaking processing is carried out on an input character sequence by utilizing the sentence-breaking model, wherein the character sequence is obtained by carrying out voice recognition processing on a collected voice signal;
and outputting the character sequence after sentence break processing.
In a third aspect, an electronic device provided in an embodiment of the present application includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein:
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above method for building a deep learning model for sentence break.
In a fourth aspect, a computer-readable medium is provided in an embodiment of the present application, and stores computer-executable instructions, where the computer-executable instructions are used to execute the above method for building a deep learning model for sentence break.
In addition, for technical effects brought by any one of the design manners in the second aspect to the fourth aspect, reference may be made to technical effects brought by different implementation manners in the first aspect, and details are not described here.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic structural diagram of a computing device of a method for establishing a sentence break model according to an embodiment of the present application;
fig. 2 is a flowchart of a method for establishing a sentence break model according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a device for establishing a sentence break model according to an embodiment of the present application.
Detailed Description
In order to perform sentence breaking on a sentence without a sentence breaking mark and improve user experience, the embodiment of the application provides a method and a device for establishing a sentence breaking model.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
To facilitate understanding of the present application, the present application refers to technical terms in which:
the punctuation mark may be not a punctuation mark, such as "|"/", but also a punctuation mark, such as", ". "? ".
The non-sentence-break mark represents a symbol which does not break a sentence, and can be specified according to actual requirements, such as a space, a tab and the like.
The words represent certain semantic phrases, and a word contains a variable number of characters, which may be one, two, three or more, for example, "I", "want", "go to school", "I", "wait to", "go to school" are all single words.
The method provided by the present application can be applied to various computing devices, and fig. 1 shows a schematic structural diagram of a computing device, where the computing device 10 shown in fig. 1 is only an example, and does not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in FIG. 1, computing device 10 is embodied in a general purpose computing apparatus, and the components of computing device 10 may include, but are not limited to: at least one processing unit 101, at least one memory unit 102, and a bus 103 that couples various system components including the memory unit 102 and the processing unit 101.
Bus 103 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.
The storage unit 102 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)1021 and/or cache memory 1022, and may further include Read Only Memory (ROM) 1023.
Storage unit 102 may also include a program/utility 1025 having a set (at least one) of program modules 1024, such program modules 1024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 10 may also communicate with one or more external devices 104 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with computing device 10, and/or with any devices (e.g., router, modem, etc.) that enable computing device 10 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 105. Moreover, computing device 10 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 106. As shown in FIG. 1, network adapter 106 communicates with other modules for computing device 10 via bus 103. It should be understood that although not shown in FIG. 1, other hardware and/or software modules may be used in conjunction with computing device 10, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Those skilled in the art will appreciate that FIG. 1 is merely exemplary of a computing device and is not intended to be limiting and may include more or less components than those shown, or some components may be combined, or different components.
The sentence-breaking model established in the embodiment of the present application can be applied to any scene that requires sentence-breaking, such as interpretation, real-time screen-up, etc., referring to fig. 2, fig. 2 is a schematic flow chart of the method for establishing the sentence-breaking model provided in the embodiment of the present application, and in the following description process, taking the method applied to the computing device 10 shown in fig. 1 as an example, the specific implementation flow of the method is as follows:
s201: and acquiring a preset number of sample sentences, wherein the end of each sample sentence is provided with a sentence break identifier.
Here, the sample sentences may be independent of each other or may have a relationship.
S202: and splicing part or all of the sample sentences, segmenting each spliced sample sentence, and determining the segmented sample sentences as the corpus sentences.
For the situation that speech data needs to be recognized in real time, it is possible that a sentence can be formed only by a part of the recognized character sequence and the character sequence recognized last time, and at this time, if a sentence is broken in the character sequences, a sentence-breaking mark is likely to appear at the middle position of the character sequence.
Therefore, after a preset number of sample sentences with sentence break marks at the end of the sentence are obtained, part or all of the sample sentences can be spliced, each spliced sample sentence is segmented, for example, the segmented sample sentences are segmented according to a set step length or randomly, and then the segmented sample sentences are used as the corpus sentences for establishing the sentence break model, so that the probability of the sentence break marks at the end of the sentence can be reduced, the probability of the sentence break marks in the sentence is improved, the scene is better fitted, and subsequently, when the established deep learning model is applied to the scene, the sentence break accuracy of the deep learning model is higher.
S203: and performing word segmentation on each corpus sentence, determining words contained in the corpus sentence, and performing word segmentation on each rare word again by using a subword segmentation algorithm if the rare words contained in the corpus sentence are determined.
The rare words refer to words with low occurrence frequency in the corpus sentences, for example, words with occurrence frequency less than the set number.
In specific implementation, after determining words contained in each corpus sentence, the tools can also give information about which words are rare words, and if it is determined that there are rare words in the words contained in the corpus sentence, a subword segmentation algorithm can be used to segment each rare word again, for example, a Byte Pair Encoding (BPE) algorithm is used to segment each rare word again, which can also be called BPE processing.
S204: and inputting a word sequence formed by words obtained after the word segmentation and the segmentation of each corpus sentence into the deep learning model for sentence segmentation and labeling.
Here, each corpus sentence is subjected to word segmentation and segmentation to obtain a plurality of words, and a word sequence of the corpus sentence can be formed according to a position of each word in the corpus sentence.
For example, if the corpus sentence is "i want to go to school", and the words obtained after the corpus sentence is subjected to word segmentation and segmentation are "i", "go to school" and "want", the word sequence finally formed according to the appearance position of each word in the corpus sentence is { i, want, go to school }.
In specific implementation, after a word sequence formed by each corpus sentence is input into the deep learning model, for each word in the word sequence, the deep learning model can analyze context information of the word in the word sequence, and then determine a first probability of marking a punctuation mark and a second probability of marking a non-punctuation mark for the word according to the context information of the word, and then mark the word by taking the mark with the highest median of the first probability and the second probability, for example, adding the mark with higher probability behind the word, and after all words are marked, outputting the marked corpus sentence.
Alternatively, the type of sentence-segment identifier for marking the material sentence may be only one, such as "/", or may be more than one, such as ",". "and"? ", the above process is described below using only one punctuation mark as an example.
For example, a word sequence formed by a corpus of sentences is: { word 1, word 2, word 3, word 4, word 5}, with a probability of 0.7 for the addition of "□" followed by word 1 and a probability of 0.3 for the addition of "/"; the probability of adding "□" after word 2 is 0.4 and the probability of adding "/" is 0.6; the probability of adding "□" after word 3 is 0.6, and the probability of adding "/" is 0.4; the probability of adding "□" after word 4 is 0.6, and the probability of adding "/" is 0.4; if the probability of adding "□" after the word 5 is 0.6 and the probability of adding "/" is 0.4, then the label for the corpus sentence is: the words 1 □, 2/3 □, 4 □, 5 □ may be output, and then the corpus sentences after tagging may be output.
In specific implementation, in order to make the labeling of each word in a material sentence more accurate, after determining a first probability of labeling a punctuation mark of the word and a second probability of labeling a non-punctuation mark of the word according to context information of the word, the first probability and the second probability can be adjusted according to labeling conditions of the labeled words in a word sequence, and then the mark with the largest median value of the adjusted first probability and the adjusted second probability is taken to label the word.
For example, a word sequence formed by a corpus of sentences is: { word 1, word 2, word 3, word 4, word 5}, where "□" has been added after word 1 and "/" has been added after word 2, taking word 3 as an example, the labeling information that word 1 and word 2 have been added in the word sequence can be analyzed after determining the probability of labeling word 3 with punctuation marks and labeling non-punctuation marks: "□" and "/", the word 2 before the word 3 is added with the sentence-breaking sign, so the possibility of adding the sentence-breaking sign after the word 3 is not too large, that is, the probability of adding "□" after the word 3 is higher, at this time, if the determined probability of marking the sentence-breaking sign "□" to the word 3 is slightly smaller as 0.6, the probability can be properly increased, and meanwhile, the probability of marking the sentence-breaking sign "/" to the word 3 is properly decreased, so that the sentence-breaking sign integrally added in the word sequence can be more consistent with the actual situation by combining the marking situation of each marked word in the word sequence, so as to further improve the sentence-breaking accuracy.
S205: and adjusting parameters of the deep learning model according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model.
In specific implementation, for each corpus sentence, whether the position of the original sentence break identifier of the corpus sentence is the same as the position of the sentence break label corresponding to the corpus sentence output by the deep learning model may be compared, and if not, the parameters of the deep learning model may be adjusted so that the position of the sentence break label corresponding to the corpus sentence output by the adjusted deep learning model is the same as the position of the original sentence break identifier of the corpus sentence.
For example, a loss function for determining a deviation between the position of the original sentence break identifier of the corpus sentence and the position of the sentence break label corresponding to the corpus sentence output by the deep learning model may be calculated, and then a gradient descent algorithm may be used to adjust parameters of the deep learning model so as to reduce the loss function, and the adjustment may be stopped until the position of the sentence break label corresponding to the corpus sentence output by the adjusted deep learning model is the same as the position of the original sentence break identifier of the corpus sentence.
S206: and testing the adjusted deep learning model by using the test sentences, and determining the marking accuracy of the deep learning model according to the test result.
Wherein the test sentence is a sentence for which the sentence break identification position is known.
S207: judging whether the marking accuracy is smaller than a preset accuracy, if not, entering S208: if yes, the process proceeds to S209.
S208: and training the adjusted deep learning model according to at least one new corpus sentence, taking the trained deep learning model as the new adjusted deep learning model, and returning to the S206.
Wherein the new corpus sentences are newly added corpus sentences, different from the corpus sentences previously used in training the sentence break model.
S209: and taking the adjusted deep learning model as the established sentence-breaking model.
S210: and performing sentence-breaking processing on the input character sequence by using the established sentence-breaking model, and outputting the character sequence after sentence-breaking processing.
The input character sequence is obtained by carrying out voice recognition processing on the collected voice signals.
Specifically, word segmentation processing can be performed on an input character sequence, if the word after word segmentation processing is determined to contain rare words, segmentation processing is still performed on each rare word by using a sub-word segmentation algorithm, then, a word sequence formed by the words obtained after word segmentation processing and segmentation processing is input into the deep learning model for sentence segmentation labeling, and the character sequence after sentence segmentation processing is output by the deep learning model.
In specific implementation, the character sequence output by the deep learning model has various kinds of label information, such as sentence break marks and non-sentence break marks, and if rare words exist, label information processed by BPE can be included, so that after the character sequence output by the deep learning model and processed by sentence break is obtained, the non-sentence break marks in the character sequence can be filtered, then the character sequence is subjected to word reversal segmentation and BPE reversal, and finally the character sequence processed by sentence break is displayed to a user. The sentence-breaking model provided by the embodiment of the application determines the probability of labeling the sentence-breaking mark and the non-sentence-marking mark for each word in the corpus sentence according to the context information of the word, and can adjust the probability of labeling the sentence-breaking mark and the non-sentence-marking mark according to the labeling condition of each labeled word in the word sequence corresponding to the corpus sentence before labeling the word, so that the mark with the highest probability is taken to label the word, and the sentence-breaking mode is very consistent with the characteristic of natural semantics, so that the sentence-breaking mode is more reasonable.
In the embodiment of the present invention, the sentence-breaking in which language can be completed is performed for which language the sample sentence used in the sentence-breaking model is set up, for example, the sentence-breaking in english can be performed for the sample sentence in english, and the sentence-breaking in chinese can be performed for the sample sentence in chinese, which is also relatively good in versatility.
In addition, the embodiment of the application also provides a network structure of the deep learning model: the method comprises the steps of embedding- > bilstm- > softmax- > crf, wherein an arrow represents the sequence of a network structure, and an embedding layer is used for coding the semanteme of each word in a word sequence formed by a corpus sentence; the bilstm layer is used for analyzing context semantics of each word according to semantic codes of a plurality of words before and after each word in the word sequence; the softmax layer is used for determining the probability of marking the punctuation mark and the non-punctuation mark for each word according to the context semantics of the word; the crf layer is used for adjusting the probability of marking the punctuation mark and the non-punctuation mark of the current word according to the marking condition of each marked word in the word sequence, marking the word by using the mark with higher probability in the punctuation mark and the non-punctuation mark after adjustment, and then outputting the final punctuation marking result.
When the method provided in the embodiments of the present application is implemented in software or hardware or a combination of software and hardware, a plurality of functional modules may be included in the computing device 10, and each functional module may include software, hardware or a combination of software and hardware. Specifically, referring to fig. 3, a schematic structural diagram of the sentence segmentation model establishing apparatus 30 provided in the embodiment of the present application includes a preprocessing module 301, a labeling module 302, and an adjusting module 303.
The preprocessing module 301 is configured to perform word segmentation processing on each obtained corpus sentence, and determine words included in the corpus sentence; determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm;
a labeling module 302, configured to input a word sequence formed by words obtained after the word segmentation processing and the segmentation processing into the deep learning model for sentence segmentation labeling;
and the adjusting module 303 is configured to adjust parameters of the deep learning model according to the original sentence break identifier of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, and establish a sentence break model.
In a possible implementation manner, the preprocessing module 301 is specifically configured to obtain a corpus sentence according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence break marks;
splicing partial or all sample sentences;
and segmenting each spliced sample sentence, and determining the segmented sample sentence as the corpus sentence.
In a possible implementation manner, the preprocessing module 301 is specifically configured to:
dividing each spliced sample sentence according to a set step length; or
And randomly segmenting each spliced sample sentence.
In one possible implementation, the subword segmentation algorithm is a byte pair encoding BPE algorithm.
In a possible implementation, the labeling module 302 is specifically configured to control the deep learning model to label each word in the word sequence according to the following steps:
analyzing the context information of the word in the word sequence;
determining a first probability of marking the punctuation mark of the word and a second probability of marking the non-punctuation mark of the word according to the context information of the word;
and selecting the identifier with the maximum median of the first probability and the second probability to label the word.
Under a possible implementation, the labeling module 302 further controls the deep learning model to perform:
after determining a first probability of marking a punctuation mark on the word and a second probability of marking a non-punctuation mark on the word according to the context information of the word, adjusting the first probability and the second probability according to the marking condition of each marked word in the word sequence;
and selecting the identifier with the maximum median of the adjusted first probability and the adjusted second probability to label the word.
In a possible implementation manner, the adjusting module 303 is specifically configured to:
and comparing the position of the original sentence break identifier of each corpus sentence with the position of the sentence break label corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the position of the sentence break label corresponding to the corpus sentence output by the deep learning model after adjustment is the same as the position of the original sentence break identifier of the corpus sentence.
In a possible implementation, the system further includes a testing module 304, configured to:
after adjusting the parameters of the deep learning model, inputting at least one test sentence into the adjusted deep learning model for sentence segmentation marking;
determining the marking accuracy rate of the adjusted deep learning model according to the original sentence break identification of the test sentence and the sentence break marking corresponding to the test sentence output by the adjusted deep learning model;
if the determined marking accuracy is greater than or equal to a preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined marking accuracy is smaller than the preset accuracy, triggering the preprocessing module, the marking module and the adjusting module, and training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence breaking model.
In a possible implementation, the sentence-breaking module 305 is further included to:
after a sentence-breaking model is established, sentence-breaking processing is carried out on an input character sequence by utilizing the sentence-breaking model, wherein the character sequence is obtained by carrying out voice recognition processing on a collected voice signal;
and outputting the character sequence after sentence break processing.
The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more modules. The coupling of the various modules to each other may be through interfaces that are typically electrical communication interfaces, but mechanical or other forms of interfaces are not excluded. Thus, modules described as separate components may or may not be physically separate, may be located in one place, or may be distributed in different locations on the same or different devices. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
An embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein:
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the above embodiments.
The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions required to be executed by the processor, and includes a program required to be executed by the processor.
In some possible embodiments, the aspects of the method for building a deep learning model for sentence break provided in the present application may also be implemented in the form of a program product, which includes program code for causing an electronic device to perform the steps in the method for building a deep learning model for sentence break according to various exemplary embodiments of the present application described above in this specification when the program product runs on the electronic device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product for performing the building of the deep learning model of sentence fragments of the embodiments of the present application may employ a portable compact disk read only memory (CD-ROM) and include program code, and may be executable on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A sentence break model establishing method is characterized by comprising the following steps:
performing word segmentation processing on each acquired corpus sentence, and determining words contained in the corpus sentence;
determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm;
inputting a word sequence formed by the words obtained after word segmentation and segmentation into a deep learning model for sentence segmentation and annotation;
and adjusting parameters of the deep learning model according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence break model.
2. The method of claim 1, wherein the corpus sentences are obtained according to the following steps:
acquiring a preset number of sample sentences, wherein sentence ends of the sample sentences are provided with sentence break marks;
splicing partial or all sample sentences;
and segmenting each spliced sample sentence, and determining the segmented sample sentence as the corpus sentence.
3. The method of claim 2, wherein segmenting each stitched sample sentence comprises:
dividing each spliced sample sentence according to a set step length; or
And randomly segmenting each spliced sample sentence.
4. The method of claim 1, wherein the subword slicing algorithm is a byte pair encoding BPE algorithm.
5. The method of claim 1, wherein the deep learning model is controlled to label each term in the sequence of terms according to the following steps:
analyzing the context information of the word in the word sequence;
determining a first probability of marking the punctuation mark of the word and a second probability of marking the non-punctuation mark of the word according to the context information of the word;
and selecting the identifier with the maximum median of the first probability and the second probability to label the word.
6. The method of claim 5, wherein after determining the first probability of labeling a punctuation mark for the term and the second probability of labeling a non-punctuation mark for the term based on the context information of the term, the deep learning model is further controlled to:
adjusting the first probability and the second probability according to the labeling condition of each labeled word in the word sequence; and
selecting the identifier with the maximum median of the first probability and the second probability to label the word, wherein the selecting step comprises the following steps:
and selecting the identifier with the maximum median of the adjusted first probability and the adjusted second probability to label the word.
7. The method according to claim 1, wherein adjusting parameters of the deep learning model according to the original sentence break identifier of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model comprises:
and comparing the position of the original sentence break identifier of each corpus sentence with the position of the sentence break label corresponding to the corpus sentence output by the deep learning model, and adjusting the parameters of the deep learning model so that the position of the sentence break label corresponding to the corpus sentence output by the deep learning model after adjustment is the same as the position of the original sentence break identifier of the corpus sentence.
8. The method of claim 1, wherein after adjusting the parameters of the deep learning model, further comprising:
inputting at least one test sentence into the adjusted deep learning model for sentence segmentation marking;
determining the marking accuracy rate of the adjusted deep learning model according to the original sentence break identification of the test sentence and the sentence break marking corresponding to the test sentence output by the adjusted deep learning model;
if the determined marking accuracy is greater than or equal to a preset accuracy, determining the adjusted deep learning model as the sentence-breaking model;
and if the determined marking accuracy is smaller than the preset accuracy, training the adjusted deep learning model according to at least one new corpus sentence to establish the sentence break model.
9. The method of claim 1, after establishing the sentence-breaking model, further comprising:
carrying out sentence-breaking processing on an input character sequence by using the sentence-breaking model, wherein the character sequence is obtained by carrying out voice recognition processing on a collected voice signal;
and outputting the character sequence after sentence break processing.
10. An apparatus for creating a sentence-breaking model, comprising:
the preprocessing module is used for carrying out word segmentation processing on each acquired corpus sentence and determining words contained in the corpus sentence; determining rare words in the words contained in the corpus sentences, and segmenting the rare words by utilizing a sub-word segmentation algorithm;
the marking module is used for inputting a word sequence formed by the words obtained after the word segmentation processing and the segmentation processing into the deep learning model for sentence breaking and marking;
and the adjusting module is used for adjusting parameters of the deep learning model according to the original sentence break identification of each corpus sentence and the sentence break label corresponding to the corpus sentence output by the deep learning model, and establishing a sentence break model.
CN201811320993.5A 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model Active CN111160004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811320993.5A CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811320993.5A CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Publications (2)

Publication Number Publication Date
CN111160004A true CN111160004A (en) 2020-05-15
CN111160004B CN111160004B (en) 2023-06-27

Family

ID=70554565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811320993.5A Active CN111160004B (en) 2018-11-07 2018-11-07 Method and device for establishing sentence-breaking model

Country Status (1)

Country Link
CN (1) CN111160004B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737991A (en) * 2020-07-01 2020-10-02 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN112397052A (en) * 2020-11-19 2021-02-23 康键信息技术(深圳)有限公司 VAD sentence-breaking test method, VAD sentence-breaking test device, computer equipment and storage medium
CN112632988A (en) * 2020-12-29 2021-04-09 文思海辉智科科技有限公司 Sentence segmentation method and device and electronic equipment
CN113779964A (en) * 2021-09-02 2021-12-10 中联国智科技管理(北京)有限公司 Statement segmentation method and device
CN116052648A (en) * 2022-08-03 2023-05-02 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160117316A1 (en) * 2014-10-24 2016-04-28 Google Inc. Neural machine translation systems with rare word processing
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160117316A1 (en) * 2014-10-24 2016-04-28 Google Inc. Neural machine translation systems with rare word processing
CN107305575A (en) * 2016-04-25 2017-10-31 北京京东尚科信息技术有限公司 The punctuate recognition methods of human-machine intelligence's question answering system and device
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN107578770A (en) * 2017-08-31 2018-01-12 百度在线网络技术(北京)有限公司 Networking telephone audio recognition method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张合;王晓东;杨建宇;周卫东;: "一种基于层叠CRF的古文断句与句读标记方法" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737991A (en) * 2020-07-01 2020-10-02 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111737991B (en) * 2020-07-01 2023-12-12 携程计算机技术(上海)有限公司 Text sentence breaking position identification method and system, electronic equipment and storage medium
CN112397052A (en) * 2020-11-19 2021-02-23 康键信息技术(深圳)有限公司 VAD sentence-breaking test method, VAD sentence-breaking test device, computer equipment and storage medium
CN112632988A (en) * 2020-12-29 2021-04-09 文思海辉智科科技有限公司 Sentence segmentation method and device and electronic equipment
CN113779964A (en) * 2021-09-02 2021-12-10 中联国智科技管理(北京)有限公司 Statement segmentation method and device
CN116052648A (en) * 2022-08-03 2023-05-02 荣耀终端有限公司 Training method, using method and training system of voice recognition model
CN116052648B (en) * 2022-08-03 2023-10-20 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Also Published As

Publication number Publication date
CN111160004B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN107657947B (en) Speech processing method and device based on artificial intelligence
CN111160004B (en) Method and device for establishing sentence-breaking model
CN107291828B (en) Spoken language query analysis method and device based on artificial intelligence and storage medium
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
CN110349564B (en) Cross-language voice recognition method and device
CN111160003B (en) Sentence breaking method and sentence breaking device
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN112188311B (en) Method and apparatus for determining video material of news
CN108897869B (en) Corpus labeling method, apparatus, device and storage medium
CN110245232B (en) Text classification method, device, medium and computing equipment
CN109726397B (en) Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN113743101B (en) Text error correction method, apparatus, electronic device and computer storage medium
CN111753524A (en) Text sentence break position identification method and system, electronic device and storage medium
JP2022120024A (en) Audio signal processing method, model training method, and their device, electronic apparatus, storage medium, and computer program
CN116166827B (en) Training of semantic tag extraction model and semantic tag extraction method and device
CN111276129A (en) Method, device and equipment for segmenting audio frequency of television series
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN111328416B (en) Speech patterns for fuzzy matching in natural language processing
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN113160820A (en) Speech recognition method, and training method, device and equipment of speech recognition model
CN114429106B (en) Page information processing method and device, electronic equipment and storage medium
CN114398952B (en) Training text generation method and device, electronic equipment and storage medium
KR102553511B1 (en) Method, device, electronic equipment and storage medium for video processing
CN114118068A (en) Method and device for amplifying training text data and electronic equipment
CN110276001B (en) Checking page identification method and device, computing equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant